Claire Vandiedonck & Sandrine Caburet
Open this notebook in a folder "GB_5".
Tutorial / Practical
In this tutorial/parctical, we will go on in our analysis of transcriptomic data from T1D patients and their related unaffected controls ("Fil Rouge" UE5 and Practical Sessions in UE1).
We will :
To run this analysis, we will use different packages among which:
# Code cell n°1
setwd("~/meg_m1_geno_bioinfo/GB_5")
getwd()
myworking_path <- "./" # modify as necessary
datapath <- "/srv/data/meg-m1-gb/DGE_results/"
# if necessary, a local directory is created for the local installation of packages, and added to .libPaths
if (length(.libPaths() == 1)) {
dir.create("~/R/x86_64-conda-linux-gnu-library/4.1")
.libPaths(c("/srv/conda/envs/notebook/lib/R/library", "~/R/x86_64-conda-linux-gnu-library/4.1"))
}
Warning message in dir.create("~/R/x86_64-conda-linux-gnu-library/4.1"):
“'/srv/home/scaburet/R/x86_64-conda-linux-gnu-library/4.1' already exists”
We install the required libraries, using a conditionnal execution, in order to avoid downloading packages that are already installed in the environment.
The rtracklayer is not installed in this environment on adenine. It will thus be installed the first time you are running this notebook. It may take a few minutes, be patient.
# Code cell n°2
# list the required libraries
requiredLib <- c("tidyverse",
"statmod",
"dendextend",
"stats",
"grDevices",
"ggnewscale",
"BiocManager",
"RColorBrewer",
"writexl",
"readxl",
"corrplot",
"GGally",
"ggcorrplot",
"ppcor",
"ggpubr",
"ggrepel")
requiredBiocLib <- c("limma",
"gplots",
"org.Hs.eg.db",
"clusterProfiler",
"enrichplot",
"ComplexHeatmap",
"GeneBreak",
"biomaRt",
"GenomicRanges",
"XVector",
"rtracklayer")
# install required libraries if not
for (lib in requiredLib) {
if (!require(lib, character.only = TRUE, quiet = TRUE)) {
install.packages(lib, quiet = TRUE, lib = "~/R/x86_64-conda-linux-gnu-library/4.1")
}
}
for( lib in requiredBiocLib) {
if (!require(lib, character.only = TRUE, quiet = TRUE)) {
BiocManager::install(lib, quiet = TRUE, update = FALSE, lib = "~/R/x86_64-conda-linux-gnu-library/4.1")
}
}
Warning message: “Failed to locate timezone database” ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ── ✔ ggplot2 3.4.0 ✔ purrr 0.3.5 ✔ tibble 3.1.8 ✔ dplyr 1.0.10 ✔ tidyr 1.2.1 ✔ stringr 1.4.1 ✔ readr 2.1.3 ✔ forcats 0.5.2 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() --------------------- Welcome to dendextend version 1.16.0 Type citation('dendextend') for how to cite the package. Type browseVignettes(package = 'dendextend') for the package vignette. The github page is: https://github.com/talgalili/dendextend/ Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues You may ask questions at stackoverflow, use the r and dendextend tags: https://stackoverflow.com/questions/tagged/dendextend To suppress this message use: suppressPackageStartupMessages(library(dendextend)) --------------------- Attaching package: ‘dendextend’ The following object is masked from ‘package:stats’: cutree Bioconductor version '3.14' is out-of-date; the current release version '3.16' is available with R version '4.2'; see https://bioconductor.org/install corrplot 0.92 loaded Registered S3 method overwritten by 'GGally': method from +.gg ggplot2 Attaching package: ‘MASS’ The following object is masked from ‘package:dplyr’: select Attaching package: ‘ggpubr’ The following object is masked from ‘package:dendextend’: rotate Attaching package: ‘gplots’ The following object is masked from ‘package:stats’: lowess Attaching package: ‘BiocGenerics’ The following object is masked from ‘package:limma’: plotMA The following objects are masked from ‘package:dplyr’: combine, intersect, setdiff, union The following objects are masked from ‘package:stats’: IQR, mad, sd, var, xtabs The following objects are masked from ‘package:base’: anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply, union, unique, unsplit, which.max, which.min Welcome to Bioconductor Vignettes contain introductory material; view with 'browseVignettes()'. To cite Bioconductor, see 'citation("Biobase")', and for packages 'citation("pkgname")'. Attaching package: ‘S4Vectors’ The following object is masked from ‘package:gplots’: space The following objects are masked from ‘package:dplyr’: first, rename The following object is masked from ‘package:tidyr’: expand The following objects are masked from ‘package:base’: expand.grid, I, unname Attaching package: ‘IRanges’ The following objects are masked from ‘package:dplyr’: collapse, desc, slice The following object is masked from ‘package:purrr’: reduce Attaching package: ‘AnnotationDbi’ The following object is masked from ‘package:MASS’: select The following object is masked from ‘package:dplyr’: select clusterProfiler v4.2.2 For help: https://yulab-smu.top/biomedical-knowledge-mining-book/ If you use clusterProfiler in published research, please cite: T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141 Attaching package: ‘clusterProfiler’ The following object is masked from ‘package:AnnotationDbi’: select The following object is masked from ‘package:IRanges’: slice The following object is masked from ‘package:S4Vectors’: rename The following object is masked from ‘package:MASS’: select The following object is masked from ‘package:purrr’: simplify The following object is masked from ‘package:stats’: filter Attaching package: ‘enrichplot’ The following object is masked from ‘package:ggpubr’: color_palette The following object is masked from ‘package:GGally’: ggtable ======================================== ComplexHeatmap version 2.10.0 Bioconductor page: http://bioconductor.org/packages/ComplexHeatmap/ Github page: https://github.com/jokergoo/ComplexHeatmap Documentation: http://jokergoo.github.io/ComplexHeatmap-reference If you use it in published research, please cite: Gu, Z. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 2016. The new InteractiveComplexHeatmap package can directly export static complex heatmaps into an interactive Shiny app with zero effort. Have a try! This message can be suppressed by: suppressPackageStartupMessages(library(ComplexHeatmap)) ======================================== Attaching package: ‘CGHcall’ The following object is masked from ‘package:BiocGenerics’: normalize Attaching package: ‘GeneBreak’ The following object is masked from ‘package:CGHcall’: segmentData The following object is masked from ‘package:Biobase’: sampleNames Attaching package: ‘XVector’ The following object is masked from ‘package:purrr’: compact 'getOption("repos")' replaces Bioconductor standard repositories, see '?repositories' for details replacement repositories: CRAN: https://cran.r-project.org Bioconductor version 3.14 (BiocManager 1.30.19), R 4.1.3 (2022-03-10) Installing package(s) 'rtracklayer' also installing the dependencies ‘MatrixGenerics’, ‘DelayedArray’, ‘SummarizedExperiment’, ‘GenomicAlignments’, ‘BiocIO’, ‘restfulr’
We verify the packages that are installed and available for loading.
# Code cell n°3
installed.packages()[,c(1,2,3)]
| Package | LibPath | Version | |
|---|---|---|---|
| abind | abind | /srv/conda/envs/notebook/lib/R/library | 1.4-5 |
| AnnotationDbi | AnnotationDbi | /srv/conda/envs/notebook/lib/R/library | 1.56.2 |
| ape | ape | /srv/conda/envs/notebook/lib/R/library | 5.6-2 |
| aplot | aplot | /srv/conda/envs/notebook/lib/R/library | 0.1.8 |
| askpass | askpass | /srv/conda/envs/notebook/lib/R/library | 1.1 |
| assertthat | assertthat | /srv/conda/envs/notebook/lib/R/library | 0.2.1 |
| backports | backports | /srv/conda/envs/notebook/lib/R/library | 1.4.1 |
| base | base | /srv/conda/envs/notebook/lib/R/library | 4.1.3 |
| base64enc | base64enc | /srv/conda/envs/notebook/lib/R/library | 0.1-3 |
| BH | BH | /srv/conda/envs/notebook/lib/R/library | 1.78.0-0 |
| Biobase | Biobase | /srv/conda/envs/notebook/lib/R/library | 2.54.0 |
| BiocFileCache | BiocFileCache | /srv/conda/envs/notebook/lib/R/library | 2.2.1 |
| BiocGenerics | BiocGenerics | /srv/conda/envs/notebook/lib/R/library | 0.40.0 |
| BiocManager | BiocManager | /srv/conda/envs/notebook/lib/R/library | 1.30.19 |
| BiocParallel | BiocParallel | /srv/conda/envs/notebook/lib/R/library | 1.28.3 |
| BiocVersion | BiocVersion | /srv/conda/envs/notebook/lib/R/library | 3.14.0 |
| biomaRt | biomaRt | /srv/conda/envs/notebook/lib/R/library | 2.50.3 |
| Biostrings | Biostrings | /srv/conda/envs/notebook/lib/R/library | 2.62.0 |
| bit | bit | /srv/conda/envs/notebook/lib/R/library | 4.0.4 |
| bit64 | bit64 | /srv/conda/envs/notebook/lib/R/library | 4.0.5 |
| bitops | bitops | /srv/conda/envs/notebook/lib/R/library | 1.0-7 |
| blob | blob | /srv/conda/envs/notebook/lib/R/library | 1.2.3 |
| boot | boot | /srv/conda/envs/notebook/lib/R/library | 1.3-28 |
| brew | brew | /srv/conda/envs/notebook/lib/R/library | 1.0-8 |
| brio | brio | /srv/conda/envs/notebook/lib/R/library | 1.1.3 |
| broom | broom | /srv/conda/envs/notebook/lib/R/library | 1.0.1 |
| bslib | bslib | /srv/conda/envs/notebook/lib/R/library | 0.4.1 |
| cachem | cachem | /srv/conda/envs/notebook/lib/R/library | 1.0.6 |
| callr | callr | /srv/conda/envs/notebook/lib/R/library | 3.7.3 |
| car | car | /srv/conda/envs/notebook/lib/R/library | 3.1-1 |
| ⋮ | ⋮ | ⋮ | ⋮ |
| usethis | usethis | /srv/conda/envs/notebook/lib/R/library | 2.1.6 |
| utf8 | utf8 | /srv/conda/envs/notebook/lib/R/library | 1.2.2 |
| utils | utils | /srv/conda/envs/notebook/lib/R/library | 4.1.3 |
| uuid | uuid | /srv/conda/envs/notebook/lib/R/library | 1.1-0 |
| vctrs | vctrs | /srv/conda/envs/notebook/lib/R/library | 0.5.0 |
| viridis | viridis | /srv/conda/envs/notebook/lib/R/library | 0.6.2 |
| viridisLite | viridisLite | /srv/conda/envs/notebook/lib/R/library | 0.4.1 |
| vroom | vroom | /srv/conda/envs/notebook/lib/R/library | 1.6.0 |
| waldo | waldo | /srv/conda/envs/notebook/lib/R/library | 0.4.0 |
| whisker | whisker | /srv/conda/envs/notebook/lib/R/library | 0.4 |
| withr | withr | /srv/conda/envs/notebook/lib/R/library | 2.5.0 |
| writexl | writexl | /srv/conda/envs/notebook/lib/R/library | 1.4.1 |
| WriteXLS | WriteXLS | /srv/conda/envs/notebook/lib/R/library | 6.4.0 |
| xfun | xfun | /srv/conda/envs/notebook/lib/R/library | 0.34 |
| XML | XML | /srv/conda/envs/notebook/lib/R/library | 3.99-0.12 |
| xml2 | xml2 | /srv/conda/envs/notebook/lib/R/library | 1.3.3 |
| xopen | xopen | /srv/conda/envs/notebook/lib/R/library | 1.0.0 |
| xtable | xtable | /srv/conda/envs/notebook/lib/R/library | 1.8-4 |
| XVector | XVector | /srv/conda/envs/notebook/lib/R/library | 0.34.0 |
| yaml | yaml | /srv/conda/envs/notebook/lib/R/library | 2.3.6 |
| yulab.utils | yulab.utils | /srv/conda/envs/notebook/lib/R/library | 0.0.5 |
| zip | zip | /srv/conda/envs/notebook/lib/R/library | 2.2.2 |
| zlibbioc | zlibbioc | /srv/conda/envs/notebook/lib/R/library | 1.40.0 |
| BiocIO | BiocIO | /srv/home/scaburet/R/x86_64-conda-linux-gnu-library/4.1 | 1.4.0 |
| DelayedArray | DelayedArray | /srv/home/scaburet/R/x86_64-conda-linux-gnu-library/4.1 | 0.20.0 |
| GenomicAlignments | GenomicAlignments | /srv/home/scaburet/R/x86_64-conda-linux-gnu-library/4.1 | 1.30.0 |
| MatrixGenerics | MatrixGenerics | /srv/home/scaburet/R/x86_64-conda-linux-gnu-library/4.1 | 1.6.0 |
| restfulr | restfulr | /srv/home/scaburet/R/x86_64-conda-linux-gnu-library/4.1 | 0.0.15 |
| rtracklayer | rtracklayer | /srv/home/scaburet/R/x86_64-conda-linux-gnu-library/4.1 | 1.54.0 |
| SummarizedExperiment | SummarizedExperiment | /srv/home/scaburet/R/x86_64-conda-linux-gnu-library/4.1 | 1.24.0 |
And we load all the librairies in the current session:
# Code cell n°4
# load libraries
message("Loading required libraries")
for (lib in requiredLib) {
library(lib, character.only = TRUE)}
for (lib in requiredBiocLib) {
library(lib, character.only = TRUE)}
library(rtracklayer)
Loading required libraries
Finally, we check that all the required librairies were correctly loaded in the current session, and we remove the temporary objects created in this section.
# Code cell n°5
# keep tracks of R and packages for this study
sessionInfo()
rm(lib, requiredLib, requiredBiocLib)
R version 4.1.3 (2022-03-10) Platform: x86_64-conda-linux-gnu (64-bit) Running under: Ubuntu 18.04.4 LTS Matrix products: default BLAS/LAPACK: /srv/conda/envs/notebook/lib/libopenblasp-r0.3.21.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] grid stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] rtracklayer_1.54.0 XVector_0.34.0 biomaRt_2.50.3 [4] GeneBreak_1.24.0 GenomicRanges_1.46.1 GenomeInfoDb_1.30.1 [7] CGHcall_2.56.0 snowfall_1.84-6.2 snow_0.4-4 [10] CGHbase_1.54.0 marray_1.72.0 DNAcopy_1.68.0 [13] impute_1.68.0 QDNAseq_1.30.0 ComplexHeatmap_2.10.0 [16] enrichplot_1.14.2 clusterProfiler_4.2.2 org.Hs.eg.db_3.14.0 [19] AnnotationDbi_1.56.2 IRanges_2.28.0 S4Vectors_0.32.4 [22] Biobase_2.54.0 BiocGenerics_0.40.0 gplots_3.1.3 [25] limma_3.50.3 ggrepel_0.9.2 ggpubr_0.4.0 [28] ppcor_1.1 MASS_7.3-58.1 ggcorrplot_0.1.4 [31] GGally_2.1.2 corrplot_0.92 readxl_1.4.1 [34] writexl_1.4.1 RColorBrewer_1.1-3 BiocManager_1.30.19 [37] ggnewscale_0.4.8 dendextend_1.16.0 statmod_1.4.37 [40] forcats_0.5.2 stringr_1.4.1 dplyr_1.0.10 [43] purrr_0.3.5 readr_2.1.3 tidyr_1.2.1 [46] tibble_3.1.8 ggplot2_3.4.0 tidyverse_1.3.2 loaded via a namespace (and not attached): [1] utf8_1.2.2 R.utils_2.12.2 [3] tidyselect_1.2.0 RSQLite_2.2.18 [5] BiocParallel_1.28.3 scatterpie_0.1.8 [7] munsell_0.5.0 codetools_0.2-18 [9] future_1.29.0 pbdZMQ_0.3-8 [11] withr_2.5.0 colorspace_2.0-3 [13] GOSemSim_2.20.0 filelock_1.0.2 [15] uuid_1.1-0 ggsignif_0.6.4 [17] DOSE_3.20.1 listenv_0.8.0 [19] MatrixGenerics_1.6.0 repr_1.1.4 [21] GenomeInfoDbData_1.2.7 polyclip_1.10-4 [23] bit64_4.0.5 farver_2.1.1 [25] downloader_0.4 parallelly_1.32.1 [27] vctrs_0.5.0 treeio_1.23.0 [29] generics_0.1.3 BiocFileCache_2.2.1 [31] R6_2.5.1 doParallel_1.0.17 [33] clue_0.3-62 graphlayouts_0.8.3 [35] DelayedArray_0.20.0 bitops_1.0-7 [37] cachem_1.0.6 reshape_0.8.9 [39] fgsea_1.20.0 gridGraphics_0.5-1 [41] assertthat_0.2.1 BiocIO_1.4.0 [43] scales_1.2.1 ggraph_2.1.0 [45] googlesheets4_1.0.1 gtable_0.3.1 [47] globals_0.16.1 tidygraph_1.2.2 [49] rlang_1.0.6 GlobalOptions_0.1.2 [51] splines_4.1.3 rstatix_0.7.1 [53] lazyeval_0.2.2 gargle_1.2.1 [55] broom_1.0.1 yaml_2.3.6 [57] reshape2_1.4.4 abind_1.4-5 [59] modelr_0.1.10 backports_1.4.1 [61] qvalue_2.26.0 tools_4.1.3 [63] ggplotify_0.1.0 ellipsis_0.3.2 [65] Rcpp_1.0.9 plyr_1.8.8 [67] progress_1.2.2 base64enc_0.1-3 [69] zlibbioc_1.40.0 RCurl_1.98-1.9 [71] prettyunits_1.1.1 GetoptLong_1.0.5 [73] viridis_0.6.2 SummarizedExperiment_1.24.0 [75] haven_2.5.1 cluster_2.1.4 [77] fs_1.5.2 magrittr_2.0.3 [79] data.table_1.14.4 DO.db_2.9 [81] circlize_0.4.15 reprex_2.0.2 [83] googledrive_2.0.0 matrixStats_0.62.0 [85] hms_1.1.2 patchwork_1.1.2 [87] evaluate_0.18 XML_3.99-0.12 [89] gridExtra_2.3 shape_1.4.6 [91] compiler_4.1.3 KernSmooth_2.23-20 [93] crayon_1.5.2 shadowtext_0.1.2 [95] R.oo_1.25.0 htmltools_0.5.3 [97] ggfun_0.0.8 tzdb_0.3.0 [99] aplot_0.1.8 lubridate_1.8.0 [101] DBI_1.1.3 tweenr_2.0.2 [103] dbplyr_2.2.1 rappdirs_0.3.3 [105] Matrix_1.5-3 car_3.1-1 [107] cli_3.4.1 R.methodsS3_1.8.2 [109] parallel_4.1.3 igraph_1.3.5 [111] pkgconfig_2.0.3 GenomicAlignments_1.30.0 [113] IRdisplay_1.1 xml2_1.3.3 [115] foreach_1.5.2 ggtree_3.7.1.001 [117] rvest_1.0.3 yulab.utils_0.0.5 [119] digest_0.6.30 Biostrings_2.62.0 [121] cellranger_1.1.0 fastmatch_1.1-3 [123] tidytree_0.4.1 restfulr_0.0.15 [125] curl_4.3.3 Rsamtools_2.10.0 [127] gtools_3.9.3 rjson_0.2.21 [129] lifecycle_1.0.3 nlme_3.1-160 [131] jsonlite_1.8.3 carData_3.0-5 [133] viridisLite_0.4.1 fansi_1.0.3 [135] pillar_1.8.1 lattice_0.20-45 [137] KEGGREST_1.34.0 fastmap_1.1.0 [139] httr_1.4.4 GO.db_3.14.0 [141] glue_1.6.2 png_0.1-7 [143] iterators_1.0.14 bit_4.0.4 [145] ggforce_0.4.1 stringi_1.7.8 [147] blob_1.2.3 caTools_1.18.2 [149] memoise_2.0.1 IRkernel_1.2 [151] future.apply_1.10.0 ape_5.6-2
At the end of the Fil Rouge, we ended up with several R objects that we will reload in this R Session.
We upload them again for this session as we did during session 1.
# Code cell n°6
load(paste(datapath, "probes.RData",sep = ""))
ls()
str(probes)
gene_list <- probes$TargetID
cat("There are ", length(unique(gene_list)), "unique genes\n")
'data.frame': 47323 obs. of 9 variables: $ ProbeID : int 6450255 2570615 6370619 2600039 2650615 5340672 2000519 3870044 7050209 1580181 ... $ CHROMOSOME : chr "7" "19" "19" "10" ... $ CYTOBAND : chr "7p15.3e" "19q13.43c" "19q13.43c" "10q11.23c" ... $ PROBE_CHR_ORIENTATION: chr "-" "-" "-" "-" ... $ PROBE_COORDINATES : chr "20147187-20147236" "63548541-63548590" "63549180-63549229" "52566586-52566635" ... $ REFSEQ_ID : chr "NM_182762.2" "NM_130786.2" "NM_130786.2" "NM_138932.1" ... $ ENTREZ_GENE_ID : int 346389 1 1 29974 29974 29974 23784 23784 23784 54715 ... $ TargetID : chr "7A5" "A1BG" "A1BG" "A1CF" ... $ SYMBOL : chr "7A5" "A1BG" "A1BG" "A1CF" ... There are 34694 unique genes
# Code cell n°7
load(paste(datapath,"samples_info.RData", sep = ""))
ls()
str(samples_info)
'data.frame': 264 obs. of 8 variables: $ array.labels: chr "5753669129_B" "5753669129_C" "5753669129_E" "5753669129_K" ... $ PedID : chr "1" "1" "1" "1" ... $ ID : chr "1" "1" "1" "1" ... $ Status : chr "2" "2" "2" "2" ... $ Stim : chr "0" "0" "6" "6" ... $ Full : chr "1_1_2_0" "1_1_2_0" "1_1_2_6" "1_1_2_6" ... $ Sex : chr "1" "1" "1" "1" ... $ Age : int 10 10 10 10 10 10 18 18 18 18 ...
We load the norm.quant object obtained after log2 transformation and between-samples quantile normalization.
Each row corresponds to an Illumina probeID, while each column name corresponds to an Illumina array sample.
The correspondance between probeID and gene names is provided in the probes object, while the experimental design matrix of samples is provided in the ̀sample_info object
# Code cell n°8
load(paste(datapath,"norm.quant.RData", sep = ""))
ls()
str(norm.quant)
norm.quant[1:10,1:10]
num [1:47323, 1:264] 7.35 7.09 6.3 6.57 6.49 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:47323] "6450255" "2570615" "6370619" "2600039" ... ..$ : chr [1:264] "5753669129_B" "5753669129_C" "5753669129_E" "5753669129_K" ...
| 5753669129_B | 5753669129_C | 5753669129_E | 5753669129_K | 5753669129_A | 5753669129_F | 5753669095_B | 5753669095_F | 5753669095_G | 5753669095_K | |
|---|---|---|---|---|---|---|---|---|---|---|
| 6450255 | 7.351549 | 7.056063 | 7.052947 | 7.398073 | 7.223727 | 6.866752 | 7.019286 | 6.986984 | 7.289715 | 7.082987 |
| 2570615 | 7.087770 | 6.909035 | 6.874572 | 6.794254 | 6.635361 | 6.766837 | 6.650753 | 6.841188 | 6.816850 | 6.684924 |
| 6370619 | 6.299524 | 6.754202 | 6.550678 | 6.574270 | 6.625025 | 6.473627 | 6.401915 | 6.318659 | 6.461961 | 6.427141 |
| 2600039 | 6.572519 | 6.609202 | 6.568940 | 6.391801 | 6.435965 | 6.782794 | 6.468696 | 6.537351 | 6.494413 | 6.615691 |
| 2650615 | 6.492314 | 6.494968 | 6.773884 | 6.507731 | 6.935357 | 6.428773 | 6.606865 | 6.618028 | 6.434543 | 6.448600 |
| 5340672 | 6.194034 | 6.138695 | 6.069135 | 6.192778 | 6.247462 | 6.235562 | 5.909282 | 6.065657 | 6.056266 | 6.199995 |
| 2000519 | 6.717292 | 6.453334 | 6.883398 | 6.958528 | 6.913643 | 6.357944 | 6.649513 | 6.774690 | 6.485854 | 6.731937 |
| 3870044 | 5.894237 | 5.753032 | 5.981951 | 5.991817 | 5.744194 | 5.728102 | 5.842047 | 5.699204 | 5.672224 | 5.871414 |
| 7050209 | 6.733664 | 6.699775 | 6.569374 | 6.527873 | 6.850950 | 6.769487 | 6.569142 | 6.539663 | 6.786768 | 6.549867 |
| 1580181 | 6.740298 | 6.402660 | 6.329971 | 6.402146 | 6.536189 | 6.628187 | 6.624250 | 6.416259 | 6.415431 | 6.628095 |
At the end of the DGE analysis, we stored all results in a list limma.full.outs with 6 dataframes:
1. with the result of the full model analysis: testing whether gene expression varies in at least one condition
2. with the result of the first contrast for the DGE analysis
3. etc...
We saved this list in an excel document DGE_T1D_microarrays.xlsx, each tab corresponding to one list element.
read_xlsxfunction.# Code cell n°9
limma.full.outs <- list()
sheet_names <- readxl::excel_sheets(paste(datapath,"DGE_T1D_microarrays.xlsx",sep=""))
for (sheet in sheet_names) {
limma.full.outs[[sheet]] <- as.data.frame(readxl::read_xlsx( paste(datapath,
"DGE_T1D_microarrays.xlsx",
sep = ""),
sheet = sheet))
}
This list contains the following 6 dataframes:
# Code cell n°10
print(names(limma.full.outs))
[1] "limma.fullmodel.out" "limma.Pat.H0.out" "limma.Pat.H24.out" [4] "limma.Pat.H6.out" "limma.Pat.H624.out" "limma.Pat.out"
For this session, we will focus on the 6th dataframe with the results of the DGE analysis comparing Patients and Controls whatever the stimulation:
=> Enter the code in cell code 11 to assign this dataframe into a `DE` object.
# Code cell n°11
DE <- limma.full.outs[[6]]
str(DE)
'data.frame': 47323 obs. of 17 variables: $ ProbeID : chr "3120474" "3310376" "630725" "6280440" ... $ logFC : num -1.583 0.915 -2.133 1.175 0.665 ... $ CI.L : num -1.979 0.657 -2.797 0.799 0.449 ... $ CI.R : num -1.186 1.174 -1.47 1.551 0.882 ... $ AveExpr : num 7.81 11.45 7.92 7.27 7.86 ... $ t : num -7.86 6.97 -6.33 6.15 6.04 ... $ P.Value : num 9.51e-14 2.48e-11 1.02e-09 2.86e-09 5.02e-09 ... $ adj.P.Val : num 4.50e-09 5.86e-07 1.62e-05 3.39e-05 4.75e-05 ... $ B : num 20 14.9 11.5 10.6 10.1 ... $ CHROMOSOME : chr "22" "11" "12" "22" ... $ CYTOBAND : chr "22q11.23c" "11q12.1a" "12q15a" "22q13.2c" ... $ PROBE_CHR_ORIENTATION: chr "+" "-" "-" "+" ... $ PROBE_COORDINATES : chr "23923092-23923141" "57296016-57296065" "68548736-68548785" "41855291-41855340" ... $ REFSEQ_ID : chr "XM_371461.4" "NM_012456.2" "NM_000619.2" "NM_001197.3" ... $ ENTREZ_GENE_ID : num 85379 26519 3458 638 282969 ... $ TargetID : chr "KIAA1671" "TIMM10" "IFNG" "BIK" ... $ SYMBOL : chr "KIAA1671" "TIMM10" "IFNG" "BIK" ...
We have the log2 Fold Change between the two conditions (here log2(Patients) - log2(Controls)), with its confidence interval, followed by the average expression in both conditions, then the statistic value t, the P.Val and adj.P.Val after Benjamini-Hochberg correction and the B statistics (the one used for pvalue). Then you have probe annotations.
Since some genes have several probe sets, we will keep only the probeID with the smallest (minimal) pvalue.
To do so, we will use tidyverse tools like the pipe (%>% in one direction or %<>% to overwrite the input with the output) and several dplyr functions: group_by(), summarise(), arrange(), left_join(), rename() and select(). For select(), this function might be masked by other packages with a function with the same name. To avoid any issue, add dplyr:: before select to be sure you will use the right one.
# Code cell n°12
DE %<>%
group_by(TargetID) %>%
summarise(minimum = min(P.Value)) %>%
arrange (minimum) %>%
left_join(DE, by = c("TargetID" = "TargetID", "minimum" = "P.Value")) %>%
rename(P.Value = minimum) %>%
dplyr::select(TargetID, ProbeID:t, P.Value, everything())
=> Explain in details the above command.
group_by() per Gene/TargetIDsummarize()arrange()left_join() we merge the output with DE using a double key, i.e. TargetID and P.Value here: each unique combination of these two variables is considered as a key. We do this here because some genes may have the same pvalue(estimated duration: 15 minutes)
Gene co-expression correlations provide a robust methodology for predicting gene function, as genes sharing a biological process or a common implication in pathways are often co-regulated.
We want to see if the DE genes identified in the previous analysis have a correlated expression in a pair-wise manner, that is if two genes share a similar pattern of expression across samples.
A Pearson or Spearman correlation is performed between continuous variables with the cor() function, resulting in a correlation coefficient between each pair of genes. This correlation is displayed in a scatter plot with the function plot().
cor() function can also be applied on a matrix of data. To have nice displays, it is often better to work on a small subset of genes. In this tutorial/practical, we decide to arbitrary use the DE genes at an adjusted pvalue of 0.001 to reduce the number of genes.
=> Find the command in cell 13 to assign the DE genes in a probes_sig object.
# Code cell n°13
probes_sig <- subset(DE, adj.P.Val < 0.001 )$ProbeID
Then we retrieve the normalised expression values for these probes:
# Code cell n°14
norm_sig <- norm.quant[probes_sig,]
str(norm_sig)
num [1:27, 1:264] 7.34 11.93 6.66 7.41 7.54 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:27] "3120474" "3310376" "630725" "6280440" ... ..$ : chr [1:264] "5753669129_B" "5753669129_C" "5753669129_E" "5753669129_K" ...
and we add the annotations for samples and genes:
# Code cell n°15
colnames(norm_sig) <- samples_info$Full
row.names(norm_sig) <- DE$TargetID[1:nrow(norm_sig)]
norm_sig
| 1_1_2_0 | 1_1_2_0 | 1_1_2_6 | 1_1_2_6 | 1_1_2_24 | 1_1_2_24 | 1_2_1_0 | 1_2_1_0 | 1_2_1_6 | 1_2_1_6 | ⋯ | 16_44_2_6 | 16_44_2_6 | 16_44_2_24 | 16_44_2_24 | 16_45_1_0 | 16_45_1_0 | 16_45_1_6 | 16_45_1_6 | 16_45_1_24 | 16_45_1_24 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| KIAA1671 | 7.343120 | 7.447613 | 7.216001 | 7.303913 | 7.326394 | 7.625109 | 8.352923 | 8.146731 | 8.108562 | 7.886871 | ⋯ | 7.192470 | 7.672861 | 7.514039 | 7.216330 | 8.853999 | 9.163023 | 8.308849 | 8.086358 | 8.817482 | 8.578759 |
| TIMM10 | 11.930091 | 11.528976 | 11.809734 | 12.001712 | 11.926766 | 11.787124 | 11.216955 | 11.052444 | 11.238711 | 11.019660 | ⋯ | 11.745586 | 12.143564 | 11.890620 | 11.705762 | 11.737390 | 11.831913 | 11.935951 | 11.935951 | 11.909019 | 11.818191 |
| IFNG | 6.657629 | 6.386879 | 7.152258 | 7.431597 | 7.039594 | 7.329202 | 6.878076 | 7.014067 | 8.163466 | 7.776781 | ⋯ | 7.081490 | 7.305481 | 8.092202 | 7.140051 | 8.180138 | 7.756707 | 9.203368 | 8.859501 | 9.212225 | 9.411588 |
| BIK | 7.409234 | 7.366837 | 7.791465 | 7.645996 | 7.784947 | 7.378133 | 6.781385 | 6.679732 | 7.126129 | 6.815981 | ⋯ | 7.780808 | 7.520157 | 7.676869 | 7.982766 | 6.991029 | 7.155203 | 7.146111 | 7.130790 | 7.082163 | 7.006517 |
| C10ORF125 | 7.538956 | 7.636997 | 7.960573 | 8.166198 | 8.086619 | 8.060599 | 7.547027 | 7.708688 | 8.078550 | 7.955620 | ⋯ | 8.165097 | 8.195478 | 8.568719 | 8.482785 | 7.810941 | 7.864739 | 8.023080 | 7.889736 | 8.164001 | 8.233983 |
| PCBP4 | 7.272471 | 7.363396 | 7.325846 | 7.340632 | 7.285869 | 7.289395 | 6.986729 | 7.314163 | 7.144670 | 7.218553 | ⋯ | 7.285711 | 7.473111 | 7.338159 | 7.415561 | 7.261692 | 7.420121 | 7.270186 | 7.068478 | 7.105724 | 7.011915 |
| TSPAN12 | 6.791838 | 6.988745 | 7.023299 | 7.265492 | 6.805628 | 6.735537 | 7.115555 | 7.296701 | 7.571201 | 7.649578 | ⋯ | 7.418026 | 7.366475 | 7.197385 | 7.050058 | 7.349810 | 7.448410 | 7.384789 | 7.844883 | 7.009110 | 7.148402 |
| ASB1 | 7.835089 | 7.808532 | 8.318611 | 8.018179 | 8.166783 | 8.181279 | 7.879742 | 7.992502 | 7.993019 | 8.146209 | ⋯ | 8.126215 | 8.481352 | 7.979627 | 7.996140 | 8.107187 | 8.124648 | 8.456424 | 8.242488 | 8.249409 | 8.120792 |
| C3ORF14 | 10.674075 | 10.513139 | 10.460329 | 10.648180 | 10.477747 | 10.517077 | 9.675615 | 9.702687 | 9.829567 | 9.589841 | ⋯ | 9.940013 | 9.762532 | 10.150245 | 9.960477 | 8.477789 | 8.276359 | 8.181577 | 8.003796 | 8.105008 | 8.330563 |
| RNASE6 | 7.891498 | 7.910376 | 7.854600 | 7.569282 | 8.099190 | 7.543098 | 7.661933 | 7.458424 | 7.245634 | 7.215838 | ⋯ | 7.707259 | 7.569733 | 7.648885 | 7.711060 | 7.242358 | 7.432355 | 7.065933 | 6.934506 | 7.088712 | 7.027634 |
| NME6 | 9.085447 | 8.888108 | 8.828382 | 8.713671 | 8.806615 | 8.630083 | 9.031067 | 9.082021 | 8.727940 | 8.697981 | ⋯ | 8.626413 | 8.942225 | 8.882572 | 8.773557 | 9.021111 | 9.052394 | 8.940664 | 8.896175 | 8.900882 | 8.926518 |
| SPACA3 | 6.446287 | 6.566525 | 6.603397 | 6.834649 | 6.964989 | 6.949504 | 6.240427 | 6.493403 | 6.516231 | 6.695443 | ⋯ | 6.893669 | 6.737023 | 7.208380 | 7.371742 | 8.070479 | 7.988796 | 8.354017 | 8.095944 | 9.369081 | 9.257145 |
| FBXL6 | 8.539947 | 8.819761 | 9.142435 | 8.929213 | 8.684563 | 8.659228 | 8.542050 | 8.614166 | 9.053404 | 9.064767 | ⋯ | 9.425600 | 9.386748 | 8.885820 | 9.224027 | 8.848285 | 9.205805 | 8.881224 | 9.215002 | 8.926149 | 8.862490 |
| CD79B | 9.907201 | 10.138199 | 9.965211 | 9.974225 | 9.214670 | 9.659481 | 10.184428 | 9.828730 | 9.713550 | 9.632895 | ⋯ | 10.430558 | 9.761215 | 10.295697 | 10.232311 | 9.819228 | 9.759964 | 9.491386 | 9.199914 | 9.332825 | 9.519570 |
| LOC642934 | 10.729839 | 10.257489 | 10.458185 | 10.493177 | 10.161385 | 10.068115 | 10.585755 | 10.658817 | 10.619108 | 10.568198 | ⋯ | 10.232767 | 10.203488 | 9.888455 | 10.033040 | 10.175672 | 10.270854 | 10.010336 | 9.840618 | 10.203948 | 9.818812 |
| MSH3 | 9.371876 | 9.312859 | 8.897862 | 9.164684 | 9.160580 | 9.041472 | 8.763834 | 8.928843 | 8.865185 | 8.760436 | ⋯ | 9.066445 | 8.891292 | 9.131001 | 9.103145 | 9.199544 | 9.022396 | 9.170367 | 9.037783 | 9.177208 | 9.112533 |
| CASP6 | 9.162004 | 9.212947 | 8.676480 | 8.899570 | 9.068155 | 8.862168 | 8.989100 | 9.344975 | 8.700772 | 8.412895 | ⋯ | 8.630386 | 8.724534 | 8.889400 | 8.714962 | 9.079358 | 9.128209 | 8.623338 | 8.717426 | 8.601114 | 8.938729 |
| HS.193406 | 8.721780 | 8.329708 | 7.557707 | 7.640953 | 7.729796 | 7.671054 | 8.320900 | 8.177352 | 7.778433 | 7.696916 | ⋯ | 7.331327 | 7.633647 | 7.384239 | 7.384964 | 8.222983 | 7.956420 | 7.592699 | 7.618116 | 7.795083 | 7.651342 |
| SH3PXD2A | 9.155893 | 8.946390 | 8.725757 | 8.564723 | 8.648760 | 8.635914 | 9.273123 | 9.267277 | 8.888787 | 8.744213 | ⋯ | 8.479863 | 8.632227 | 8.358260 | 8.691702 | 9.231458 | 9.296297 | 8.716477 | 8.843244 | 8.842254 | 8.825829 |
| STAT1 | 13.375814 | 13.580293 | 13.577504 | 13.258620 | 13.732336 | 13.882736 | 13.596645 | 13.563741 | 13.493375 | 13.786548 | ⋯ | 13.501741 | 13.596645 | 14.103464 | 13.804827 | 13.913144 | 13.544191 | 13.583307 | 13.391263 | 14.263308 | 14.177526 |
| FAM101B | 6.923534 | 7.029156 | 6.977115 | 7.299546 | 7.256118 | 7.063625 | 7.237992 | 7.412065 | 7.564514 | 7.424746 | ⋯ | 7.416691 | 7.210184 | 7.447194 | 7.234131 | 7.539998 | 7.428534 | 7.855840 | 7.794578 | 7.615165 | 7.662633 |
| LOC728453 | 14.198776 | 14.117630 | 14.075082 | 14.324924 | 14.082040 | 14.075082 | 14.163553 | 14.180867 | 14.085453 | 13.971812 | ⋯ | 14.255541 | 14.166816 | 14.362484 | 14.343618 | 14.405436 | 14.324924 | 14.245065 | 14.303088 | 14.217228 | 14.213588 |
| STARD13 | 7.909351 | 7.675303 | 7.345735 | 7.412446 | 6.876136 | 7.128194 | 8.601114 | 8.237297 | 8.420801 | 8.308849 | ⋯ | 7.985645 | 8.121320 | 7.859027 | 7.423386 | 8.314272 | 7.808532 | 7.808532 | 7.864739 | 7.631234 | 7.792674 |
| LAD1 | 9.838995 | 8.921661 | 9.401012 | 9.859190 | 8.851453 | 9.916026 | 9.624226 | 9.024438 | 9.614616 | 9.530317 | ⋯ | 8.141968 | 8.683920 | 8.241895 | 8.332227 | 11.626843 | 11.411502 | 11.989721 | 11.969141 | 11.659498 | 11.649056 |
| SESTD1 | 8.020904 | 7.815665 | 8.000565 | 7.934491 | 7.875002 | 8.125960 | 7.995893 | 7.884513 | 7.793893 | 7.734771 | ⋯ | 7.989067 | 7.673951 | 7.943851 | 7.974716 | 8.059783 | 8.024919 | 7.933985 | 7.689644 | 7.814232 | 7.999524 |
| UBD | 9.651761 | 9.301834 | 8.676808 | 9.264475 | 8.410616 | 9.032068 | 9.871305 | 9.503186 | 9.323416 | 8.670923 | ⋯ | 7.820166 | 8.802034 | 8.643236 | 7.472283 | 11.718585 | 10.755954 | 12.028912 | 10.658817 | 11.510679 | 12.088311 |
| ST3GAL6 | 7.233085 | 7.052837 | 7.493180 | 7.646658 | 7.163081 | 7.275957 | 7.202096 | 7.146111 | 7.777929 | 7.803175 | ⋯ | 7.010689 | 7.311506 | 6.990942 | 6.716431 | 6.601188 | 6.640908 | 7.368470 | 7.672178 | 7.034240 | 7.044435 |
In the above command, notice that we asked to display the tibble but only some columns are displayed.
heatmap functionWe will have first to transpose the matrix of data with the t() function to have each gene name as a variable.
# Code cell n°16
heatmap(cor(t(norm_sig)))
corrplot() function: https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html# Code cell n°17
options(repr.plot.width = 10, repr.plot.height = 10)
corrplot(cor(t(norm_sig)))
# method = "square" is the default
# hclust clusterises the genes, grouping together the most correlated ones
# rectangles and correlation coefficients are also added
corrplot(cor(t(norm_sig)),
method = "square",
order = 'hclust',
addrect = 4,
addCoef.col = 'black',
number.cex = 0.4)
=> Find the arguments of the corrplot function in cell 18 with the only the upper matrix displayed and using an ellipse representative of each pair-wise correlation.
# Code cell n°18
# by definition a correlation matrix is symmetrical.
# type = 'upper' will show only the upper half of the matrix
corrplot(cor(t(norm_sig)),
method = 'ellipse',
type = 'upper',
insig = 'blank')
This all-in-one function computes the correlation matrix with the correlation coefficients and pvalues of the correlation tests and draws the correlogram.
# Code cell n°19
source("http://www.sthda.com/upload/rquery_cormat.r")
options(width = 200)
rquery.cormat(t(norm_sig))
$r
FBXL6 LOC728453 FAM101B TIMM10 C10ORF125 MSH3 SESTD1 NME6 CASP6 RNASE6 CD79B C3ORF14 BIK PCBP4 LOC642934 STARD13 KIAA1671 HS.193406 SH3PXD2A STAT1 IFNG UBD SPACA3 LAD1
FBXL6 1
LOC728453 0.36 1
FAM101B 0.3 0.15 1
TIMM10 0.3 0.26 0.3 1
C10ORF125 0.33 0.29 0.36 0.38 1
MSH3 -0.061 0.099 0.049 0.2 -0.04 1
SESTD1 0.11 0.28 -0.14 0.042 0.2 0.23 1
NME6 0.16 0.44 0.19 0.25 0.26 0.4 0.31 1
CASP6 -0.033 0.39 0.066 0.16 0.12 0.37 0.34 0.62 1
RNASE6 -0.11 0.17 0.018 0.044 -0.12 0.24 0.23 0.39 0.49 1
CD79B 0.16 0.34 -0.062 -0.091 -0.14 0.2 0.4 0.41 0.46 0.49 1
C3ORF14 -0.052 0.042 -0.14 0.14 0.2 0.19 0.16 0.23 0.31 0.18 0.048 1
BIK 0.29 0.076 -0.031 0.2 0.2 0.078 0.15 -0.0023 -0.028 0.068 0.013 0.25 1
PCBP4 0.36 0.3 -0.047 -0.0056 0.16 0.064 0.34 0.18 0.3 0.11 0.4 0.17 0.41 1
LOC642934 -0.2 -0.05 -0.2 -0.14 -0.4 0.0056 -0.039 0.093 -0.023 0.048 0.097 0.044 -0.12 -0.2 1
STARD13 0.07 0.16 -0.23 -0.2 -0.32 -0.15 -0.073 -0.039 -0.0072 -0.075 0.21 -0.17 -0.22 0.025 0.16 1
KIAA1671 -0.22 -0.17 -0.27 -0.22 -0.34 -0.041 -0.17 -0.26 -0.24 -0.25 -0.18 -0.29 -0.23 -0.2 0.2 0.42 1
HS.193406 -0.31 -0.16 -0.34 -0.22 -0.66 0.16 -0.062 0.022 0.15 0.22 0.23 -0.047 -0.29 -0.074 0.34 0.46 0.52 1
SH3PXD2A -0.21 -0.066 -0.47 -0.36 -0.46 -0.0031 0.047 -0.13 0.075 0.085 0.23 -0.051 -0.15 0.065 0.31 0.42 0.43 0.61 1
STAT1 -0.34 -0.33 -0.25 -0.13 -0.011 -0.059 -0.045 -0.28 -0.38 -0.2 -0.3 0.093 -0.0016 -0.31 0.12 -0.074 0.26 -0.021 0.076 1
IFNG -0.23 -0.43 0.075 -0.081 0.037 -0.12 -0.35 -0.41 -0.52 -0.42 -0.52 -0.14 -0.084 -0.37 0.016 -0.12 0.4 -0.061 -0.11 0.6 1
UBD -0.26 -0.25 -0.034 -0.15 -0.26 0.017 -0.23 -0.18 -0.29 -0.26 -0.19 -0.18 -0.3 -0.32 0.21 0.19 0.46 0.33 0.23 0.49 0.73 1
SPACA3 -0.097 -0.045 -0.076 0.11 0.18 -0.091 -0.2 -0.064 -0.29 -0.29 -0.44 -0.19 -0.22 -0.26 -0.083 -0.093 0.33 -0.1 -0.16 0.34 0.35 0.22 1
LAD1 -0.005 -0.05 -0.097 -0.11 -0.019 -0.047 -0.034 -0.19 -0.24 -0.45 -0.34 -0.35 -0.014 0.017 0.088 0.048 0.43 0.029 0.22 0.17 0.27 0.24 0.38 1
TSPAN12 -0.074 -0.094 -0.19 -0.32 -0.19 0.027 0.036 -0.11 -0.067 -0.35 0.09 -0.14 -0.28 -0.017 0.075 0.2 0.22 0.11 0.12 -0.06 3e-04 0.042 0.062 0.27
ASB1 -0.26 -0.46 -0.14 -0.19 -0.2 -0.28 -0.39 -0.51 -0.51 -0.24 -0.46 -0.21 -0.13 -0.34 0.011 -0.0035 0.18 -0.012 0.14 0.21 0.41 0.22 0.12 0.2
ST3GAL6 -0.13 -0.26 0.14 -0.12 -0.13 -0.14 -0.25 -0.38 -0.43 -0.23 -0.35 -0.38 -0.25 -0.38 0.0044 0.15 0.16 -0.033 -0.086 0.072 0.28 0.15 0.12 0.16
TSPAN12 ASB1 ST3GAL6
FBXL6
LOC728453
FAM101B
TIMM10
C10ORF125
MSH3
SESTD1
NME6
CASP6
RNASE6
CD79B
C3ORF14
BIK
PCBP4
LOC642934
STARD13
KIAA1671
HS.193406
SH3PXD2A
STAT1
IFNG
UBD
SPACA3
LAD1
TSPAN12 1
ASB1 0.097 1
ST3GAL6 0.17 0.43 1
$p
FBXL6 LOC728453 FAM101B TIMM10 C10ORF125 MSH3 SESTD1 NME6 CASP6 RNASE6 CD79B C3ORF14 BIK PCBP4 LOC642934 STARD13 KIAA1671 HS.193406 SH3PXD2A STAT1 IFNG UBD
FBXL6 0
LOC728453 1.2e-09 0
FAM101B 5.8e-07 0.013 0
TIMM10 8.1e-07 1.9e-05 7.7e-07 0
C10ORF125 6.4e-08 1.1e-06 2.7e-09 1e-10 0
MSH3 0.32 0.11 0.43 0.0015 0.52 0
SESTD1 0.081 4.9e-06 0.019 0.5 0.001 0.00012 0
NME6 0.0074 1e-13 0.0019 3.8e-05 2.6e-05 1.6e-11 3.1e-07 0
CASP6 0.59 5.9e-11 0.29 0.011 0.055 7e-10 1.7e-08 1.8e-29 0
RNASE6 0.081 0.0064 0.77 0.48 0.056 0.00011 0.00021 6.7e-11 3.5e-17 0
CD79B 0.0096 1.6e-08 0.32 0.14 0.023 0.0012 2.2e-11 5.1e-12 2.8e-15 1.5e-17 0
C3ORF14 0.4 0.5 0.021 0.019 0.00083 0.0015 0.0084 0.00013 3.3e-07 0.004 0.44 0
BIK 1.8e-06 0.22 0.62 0.00088 0.0011 0.2 0.015 0.97 0.65 0.27 0.83 4.8e-05 0
PCBP4 1.3e-09 8.9e-07 0.44 0.93 0.01 0.3 2.3e-08 0.0027 5.2e-07 0.086 2.2e-11 0.0067 6e-12 0
LOC642934 0.0011 0.42 0.00082 0.024 1.3e-11 0.93 0.53 0.13 0.71 0.44 0.11 0.48 0.062 0.00098 0
STARD13 0.26 0.0078 0.00021 0.00091 1.7e-07 0.013 0.24 0.53 0.91 0.22 8e-04 0.0046 0.00025 0.68 0.011 0
KIAA1671 0.00027 0.0048 1.1e-05 0.00035 1.7e-08 0.51 0.006 1.5e-05 1e-04 2.8e-05 0.0027 1.2e-06 0.00015 0.0011 0.0012 5.5e-13 0
HS.193406 1.8e-07 0.0099 2.2e-08 0.00026 4.7e-34 0.012 0.32 0.72 0.015 0.00025 0.00018 0.45 1.1e-06 0.23 1.1e-08 2.7e-15 6e-20 0
SH3PXD2A 0.00078 0.29 3.4e-16 9.8e-10 5.3e-15 0.96 0.45 0.036 0.22 0.17 0.00015 0.41 0.013 0.29 3.2e-07 8.5e-13 2.8e-13 3e-28 0
STAT1 1.4e-08 5e-08 3.7e-05 0.033 0.86 0.34 0.47 5.1e-06 1.6e-10 0.0014 1e-06 0.13 0.98 1.9e-07 0.059 0.23 1.7e-05 0.73 0.22 0
IFNG 0.00015 1.3e-13 0.23 0.19 0.54 0.059 4.6e-09 2.6e-12 7.6e-20 5.5e-13 4.7e-20 0.027 0.17 9e-10 0.79 0.051 2e-11 0.32 0.074 6.5e-27 0
UBD 2.3e-05 4e-05 0.58 0.018 1.5e-05 0.78 0.00013 0.0043 2.1e-06 2.6e-05 0.0023 0.0037 6.6e-07 7.3e-08 0.00053 0.002 3.6e-15 2.8e-08 0.00014 1.5e-17 7.9e-46 0
SPACA3 0.12 0.47 0.22 0.069 0.0034 0.14 0.0012 0.3 1.2e-06 1.1e-06 1.1e-13 0.0025 0.00034 1.9e-05 0.18 0.13 3.1e-08 0.094 0.0078 8.9e-09 4.6e-09 0.00023
LAD1 0.94 0.41 0.12 0.079 0.76 0.44 0.58 0.0015 8.1e-05 2.9e-14 1.2e-08 5.1e-09 0.82 0.78 0.15 0.43 2.3e-13 0.63 0.00043 0.0049 1e-05 8.4e-05
TSPAN12 0.23 0.13 0.0024 1.3e-07 0.0019 0.66 0.56 0.078 0.28 7.1e-09 0.14 0.024 5.6e-06 0.78 0.22 0.0013 3e-04 0.077 0.056 0.33 1 0.5
ASB1 2.6e-05 3.6e-15 0.02 0.0016 0.0012 5e-06 6e-11 1.2e-18 8.2e-19 8e-05 5.5e-15 0.00059 0.033 1.6e-08 0.86 0.95 0.0034 0.84 0.023 0.00049 3.4e-12 0.00025
ST3GAL6 0.035 2.3e-05 0.026 0.056 0.04 0.022 2.8e-05 1.7e-10 1.4e-13 0.00015 4.2e-09 1.5e-10 3e-05 2.1e-10 0.94 0.014 0.0096 0.59 0.16 0.25 4.7e-06 0.012
SPACA3 LAD1 TSPAN12 ASB1 ST3GAL6
FBXL6
LOC728453
FAM101B
TIMM10
C10ORF125
MSH3
SESTD1
NME6
CASP6
RNASE6
CD79B
C3ORF14
BIK
PCBP4
LOC642934
STARD13
KIAA1671
HS.193406
SH3PXD2A
STAT1
IFNG
UBD
SPACA3 0
LAD1 2.6e-10 0
TSPAN12 0.32 1.2e-05 0
ASB1 0.046 9e-04 0.12 0
ST3GAL6 0.048 0.011 0.0047 2.7e-13 0
$sym
FBXL6 LOC728453 FAM101B TIMM10 C10ORF125 MSH3 SESTD1 NME6 CASP6 RNASE6 CD79B C3ORF14 BIK PCBP4 LOC642934 STARD13 KIAA1671 HS.193406 SH3PXD2A STAT1 IFNG UBD SPACA3 LAD1 TSPAN12 ASB1 ST3GAL6
FBXL6 1
LOC728453 . 1
FAM101B 1
TIMM10 1
C10ORF125 . . . 1
MSH3 1
SESTD1 1
NME6 . . . 1
CASP6 . . . , 1
RNASE6 . . 1
CD79B . . . . . 1
C3ORF14 . 1
BIK 1
PCBP4 . . . . 1
LOC642934 . 1
STARD13 . 1
KIAA1671 . . 1
HS.193406 . . , . . . 1
SH3PXD2A . . . . . . , 1
STAT1 . . . . 1
IFNG . . . . . . . . . 1
UBD . . . . , 1
SPACA3 . . . . 1
LAD1 . . . . . 1
TSPAN12 . . 1
ASB1 . . . . . . . 1
ST3GAL6 . . . . . . 1
attr(,"legend")
[1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1
We saw this package in the first session of this option.
=> Find the commands to display the correlogram with the -log10 pvalues dispayed as an annotation using the ComplexHeatmap package.
# Code cell n°20
# by definition a correlation matrix is symmetrical.
# type = 'upper' will show only the upper half of the matrix
corrplot(cor(t(norm_sig)),
method = 'ellipse',
type = 'upper',
insig = 'blank')
See nice tuto: http://www.sthda.com/english/wiki/ggcorrplot-visualization-of-a-correlation-matrix-using-ggplot2 where you can color only the significant ones.
# Code cell n°21
ggcorrplot(cor(t(norm_sig)),
p.mat = cor_pmat(t(norm_sig)) ,
hc.order = TRUE,
type = "lower",
insig = "blank",
lab = TRUE,
lab_size = 3)
# Code cell n°22
# Pearson correlation coefficients, using pairwise observations (default method)
GGally::ggcorr(t(norm_sig),
method = c("pairwise", "pearson"),
label = TRUE,
label_size = 3)
# Code cell n°23
GGally::ggcorr(t(norm_sig),
method = c("pairwise", "pearson"),
geom = "circle",
nbreaks = 5,
min_size = 0,
max_size = 7)
# Code cell n°24
GGally::ggpairs(as.data.frame(t(norm_sig)),
columns = 1:5,
ggplot2::aes(colour = samples_info$Status, alpha = 0.5))
# Code cell n°25
GGally::ggpairs(as.data.frame(t(norm_sig)),
columns = 1:5,
ggplot2::aes(colour = samples_info$Stim, alpha = 0.5))
=> Explain the command of the above two code cells n°24 and 25
The ppcor package computes partial correlations.
# Code cell n°26
res <- pcor(t(norm_sig))
str(res)
colnames(res[[1]]) <- row.names(norm_sig)
row.names(res[[1]]) <- row.names(norm_sig)
ggcorrplot(res[[1]],
p.mat = res[[2]] ,
hc.order = TRUE,
type = "lower",
insig = "blank",
lab=TRUE,
lab_size = 3)
Warning message in pcor(t(norm_sig)): “The inverse of variance-covariance matrix is calculated using Moore-Penrose generalized matrix invers due to its determinant of zero.”
List of 6 $ estimate : num [1:27, 1:27] 1 -0.00654 0.42944 0.04727 -0.04638 ... $ p.value : num [1:27, 1:27] 0.00 9.20e-01 3.83e-12 4.67e-01 4.75e-01 ... $ statistic: num [1:27, 1:27] 0 -0.101 7.32 0.729 -0.715 ... $ n : int 264 $ gp : num 25 $ method : chr "pearson"
(estimated duration: 60 minutes)
We will focus on the Transcription Factor STAT1 whose gene is significantly downregulated in patients compared to controls.
We will use data from "Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing" from Roberston G. et al. Nature Methods 2007 PMID: 17558387; DOI: 10.1038/nmeth1068. It was amnong the very first papers analysing ChIPSeq data. In this dataset, they analysed Hela HeLa S3 cells that were stimulated or not with gamma-interferon.
The data are available on the GEO database with the ID: GSE15353. We will get the 4 supplemnatry files that are:
| File | link | Size | Unit |
|---|---|---|---|
| GSE15353_STAT1_hg18_IFNg_ht11.peaks.txt | https://ftp.ncbi.nlm.nih.gov/geo/series/GSE15nnn/GSE15353/suppl/GSE15353%5FSTAT1%5Fhg18%5FIFNg%5Fht11%2Epeaks%2Etxt | 1.3 | Mb |
| GSE15353_STAT1_hg18_IFNg_ht11.wig | https://ftp.ncbi.nlm.nih.gov/geo/series/GSE15nnn/GSE15353/suppl/GSE15353%5FSTAT1%5Fhg18%5FIFNg%5Fht11%2Ewig | 96.2 | Mb |
| GSE15353_STAT1_hg18_Unstimulated_ht11.peaks.txt | https://ftp.ncbi.nlm.nih.gov/geo/series/GSE15nnn/GSE15353/suppl/GSE15353%5FSTAT1%5Fhg18%5FUnstimulated%5Fht11%2Epeaks%2Etxt | 342.6 | Kb |
| GSE15353_STAT1_hg18_Unstimulated_ht11.wig | ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE15nnn/GSE15353/suppl/GSE15353%5FSTAT1%5Fhg18%5FUnstimulated%5Fht11%2Ewig | 21.6 | Mb |
=> Open a terminal and download each one with the wget command followed by the link: bash commands below in your working directory.
Have a look at the first rows of the wig .txt files with less -SN filename. They contain the STAT1 peaks with the height of the peaks and their genomic coordinates. For example, you can enter the following command in the terminal (use q to quit).
We import the text files in R.
# Code cell n°27
STAT1_unstim_peaks_hg18 <- read.table("GSE15353_STAT1_hg18_Unstimulated_ht11.peaks.txt", header=T)
str(STAT1_unstim_peaks_hg18)
STAT1_ifng_peaks_hg18 <- read.table("GSE15353_STAT1_hg18_IFNg_ht11.peaks.txt", header=T)
str(STAT1_ifng_peaks_hg18)
'data.frame': 11004 obs. of 6 variables: $ chr : chr "10" "10" "10" "10" ... $ height : int 14 11 12 11 12 11 13 14 11 1935 ... $ start : int 967275 2010523 2496490 3075739 3122866 3857309 4376424 4736305 4881845 5461184 ... $ end : int 967866 2011422 2497023 3076124 3123854 3857940 4377362 4736905 4882581 5461374 ... $ numtags : int 20 28 23 13 25 22 26 22 22 1936 ... $ max_ht_offset: int 379 374 214 130 394 192 349 226 150 43 ... 'data.frame': 41582 obs. of 6 variables: $ chr : chr "10" "10" "10" "10" ... $ height : int 46 30 12 13 14 12 11 30 31 29 ... $ start : int 142456 302744 389895 729167 1023827 1083937 1092125 1301013 1390001 1986088 ... $ end : int 143253 303395 390540 729993 1024778 1085837 1093202 1301397 1392080 1987012 ... $ numtags : int 74 44 23 27 28 70 30 36 94 76 ... $ max_ht_offset: int 414 405 338 371 539 1186 581 171 1716 494 ...
We will add an ID to each peak.
# Code cell n°28
STAT1_unstim_peaks_hg18$ID <- paste0("unstim_", 1:nrow(STAT1_unstim_peaks_hg18))
str(STAT1_unstim_peaks_hg18)
STAT1_ifng_peaks_hg18$ID <- paste0("ifng_", 1:nrow(STAT1_ifng_peaks_hg18))
str(STAT1_ifng_peaks_hg18)
'data.frame': 11004 obs. of 7 variables: $ chr : chr "10" "10" "10" "10" ... $ height : int 14 11 12 11 12 11 13 14 11 1935 ... $ start : int 967275 2010523 2496490 3075739 3122866 3857309 4376424 4736305 4881845 5461184 ... $ end : int 967866 2011422 2497023 3076124 3123854 3857940 4377362 4736905 4882581 5461374 ... $ numtags : int 20 28 23 13 25 22 26 22 22 1936 ... $ max_ht_offset: int 379 374 214 130 394 192 349 226 150 43 ... $ ID : chr "unstim_1" "unstim_2" "unstim_3" "unstim_4" ... 'data.frame': 41582 obs. of 7 variables: $ chr : chr "10" "10" "10" "10" ... $ height : int 46 30 12 13 14 12 11 30 31 29 ... $ start : int 142456 302744 389895 729167 1023827 1083937 1092125 1301013 1390001 1986088 ... $ end : int 143253 303395 390540 729993 1024778 1085837 1093202 1301397 1392080 1987012 ... $ numtags : int 74 44 23 27 28 70 30 36 94 76 ... $ max_ht_offset: int 414 405 338 371 539 1186 581 171 1716 494 ... $ ID : chr "ifng_1" "ifng_2" "ifng_3" "ifng_4" ...
Caution: the coordinates are on hg18
The most popular method to do the conversion on hg38 is using the UCSC liftOver tool.
To do so, you need to provide .BED files. A .BED file has 3 mandatory columns: the chromosome, the start and the end positions of the feature (see documentation here: https://genome.ucsc.edu/FAQ/FAQformat.html#format1)
Below are the commands to generate .BED corresponding to the `.txt̀ files.
# Code cell n°29
temp <- STAT1_unstim_peaks_hg18
temp$chr <- paste0("chr",temp$chr)
write.table(temp[,c(1,3:4,7,2,5:6)], file = "STAT1_unstim_peaks_hg18.bed", quote=F, sep="\t", row.names = FALSE, col.names = FALSE)
temp <- STAT1_ifng_peaks_hg18
temp$chr <- paste0("chr",temp$chr)
write.table(temp[,c(1,3:4,7,2,5:6)], file = "STAT1_ifng_peaks_hg18.bed", quote=F, sep="\t", row.names = FALSE, col.names = FALSE)
rm(temp)
# Code cell n°30
STAT1_unstim_peaks_hg38 <-read.table("STAT1_unstim_peaks_hg38.bed", sep="\t", header=F)
names(STAT1_unstim_peaks_hg38) <- names(STAT1_unstim_peaks_hg18)[c(1,3:4,7,2,5:6)]
str(STAT1_unstim_peaks_hg38)
'data.frame': 10969 obs. of 7 variables: $ chr : chr "chr10" "chr10" "chr10" "chr10" ... $ start : int 931335 1978329 2464298 3043547 3090674 3825117 4344232 4704113 4849653 5429221 ... $ end : int 931926 1979228 2464831 3043932 3091662 3825748 4345170 4704713 4850389 5429411 ... $ ID : chr "unstim_1" "unstim_2" "unstim_3" "unstim_4" ... $ height : int 14 11 12 11 12 11 13 14 11 1935 ... $ numtags : int 2 2 2 1 2 2 2 2 2 1 ... $ max_ht_offset: int 379 374 214 130 394 192 349 226 150 43 ...
# Code cell n°31
STAT1_ifng_peaks_hg38 <-read.table("STAT1_ifng_peaks_hg38.bed", sep="\t", header=F)
names(STAT1_ifng_peaks_hg38) <- names(STAT1_ifng_peaks_hg18)[c(1,3:4,7,2,5:6)]
str(STAT1_ifng_peaks_hg38)
'data.frame': 41536 obs. of 7 variables: $ chr : chr "chr10" "chr10" "chr10" "chr10" ... $ start : int 106516 266804 353955 693227 987887 1047997 1056185 1268818 1357806 1953894 ... $ end : int 107313 267455 354600 694053 988838 1049897 1057262 1269202 1359885 1954818 ... $ ID : chr "ifng_1" "ifng_2" "ifng_3" "ifng_4" ... $ height : int 46 30 12 13 14 12 11 30 31 29 ... $ numtags : int 7 4 2 2 2 7 3 3 9 7 ... $ max_ht_offset: int 414 405 338 371 539 1186 581 171 1716 494 ...
rtracklayer package¶The liftOver tool is implemented in the rtracklayer Bioconductor package (https://bioconductor.org/packages/release/bioc/html/rtracklayer.html.)
This package is dedicated to handle and generate tracks for the UCSC genome browser, or to query its genome database.
UCSC stored conversion chains between genome assemblies. On the UCSC download web page (http://hgdownload.soe.ucsc.edu/downloads.html), you need to find the link corresponding to the starting assembly and the chain file for you conversion of interest. Here we find the conversion chains strating from hg18 at this link: http://hgdownload.soe.ucsc.edu/goldenPath/hg18/liftOver/.
We are lucky the chain to hg38 is avalaible: http://hgdownload.soe.ucsc.edu/goldenPath/hg18/liftOver/hg18ToHg38.over.chain.gz
However, if you want to convert to an assembly which is not listed here, for example the T2T genome, you have to do a conversion to an intermediate assembly.
=> Open a terminal and download this chain gunzipped file with the wget command followed by the link: bash commands below in your working directory.
Then unzip it prior to importing it to R with rtraklayer.
# Code cell n°32
chain <- rtracklayer::import.chain("./hg18ToHg38.over.chain")
names(chain)
As you can notice, chromosome names have the "chr" prefix, while our STAT1 peaks objects have not. We thus homogeneize chromosome names in our peak dataframes by adding "chr.
# Code cell n°33
STAT1_unstim_peaks_hg18$chr <- paste0("chr", STAT1_unstim_peaks_hg18$chr)
STAT1_ifng_peaks_hg18$chr <- paste0("chr", STAT1_ifng_peaks_hg18$chr)
To use the liftOver function of rtracklayer, the imput file we want to convert has to be a genomic object of class GRanges from packageGenomicRanges, as we have seen them in R session 5 in the R5_tuto_packages.ipynb.
The makeGRangesFromDataFramefunction generate these GR objects.
# Code cell n°34
STAT1_unstim_peaks_hg18GR <- makeGRangesFromDataFrame(STAT1_unstim_peaks_hg18, keep.extra.columns = TRUE)
str(STAT1_unstim_peaks_hg18GR)
Formal class 'GRanges' [package "GenomicRanges"] with 7 slots ..@ seqnames :Formal class 'Rle' [package "S4Vectors"] with 4 slots .. .. ..@ values : Factor w/ 24 levels "chr1","chr2",..: 10 11 12 13 14 15 16 17 18 19 ... .. .. ..@ lengths : int [1:24] 535 450 653 159 221 388 579 602 104 416 ... .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ ranges :Formal class 'IRanges' [package "IRanges"] with 6 slots .. .. ..@ start : int [1:11004] 967275 2010523 2496490 3075739 3122866 3857309 4376424 4736305 4881845 5461184 ... .. .. ..@ width : int [1:11004] 592 900 534 386 989 632 939 601 737 191 ... .. .. ..@ NAMES : NULL .. .. ..@ elementType : chr "ANY" .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ strand :Formal class 'Rle' [package "S4Vectors"] with 4 slots .. .. ..@ values : Factor w/ 3 levels "+","-","*": 3 .. .. ..@ lengths : int 11004 .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ seqinfo :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots .. .. ..@ seqnames : chr [1:24] "chr1" "chr2" "chr3" "chr4" ... .. .. ..@ seqlengths : int [1:24] NA NA NA NA NA NA NA NA NA NA ... .. .. ..@ is_circular: logi [1:24] NA NA NA NA NA NA ... .. .. ..@ genome : chr [1:24] NA NA NA NA ... ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots .. .. ..@ rownames : NULL .. .. ..@ nrows : int 11004 .. .. ..@ listData :List of 4 .. .. .. ..$ height : int [1:11004] 14 11 12 11 12 11 13 14 11 1935 ... .. .. .. ..$ numtags : int [1:11004] 20 28 23 13 25 22 26 22 22 1936 ... .. .. .. ..$ max_ht_offset: int [1:11004] 379 374 214 130 394 192 349 226 150 43 ... .. .. .. ..$ ID : chr [1:11004] "unstim_1" "unstim_2" "unstim_3" "unstim_4" ... .. .. ..@ elementType : chr "ANY" .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ elementType : chr "ANY" ..@ metadata : list()
# Code cell n°35
head(STAT1_unstim_peaks_hg18GR)
GRanges object with 6 ranges and 4 metadata columns:
seqnames ranges strand | height numtags max_ht_offset ID
<Rle> <IRanges> <Rle> | <integer> <integer> <integer> <character>
[1] chr10 967275-967866 * | 14 20 379 unstim_1
[2] chr10 2010523-2011422 * | 11 28 374 unstim_2
[3] chr10 2496490-2497023 * | 12 23 214 unstim_3
[4] chr10 3075739-3076124 * | 11 13 130 unstim_4
[5] chr10 3122866-3123854 * | 12 25 394 unstim_5
[6] chr10 3857309-3857940 * | 11 22 192 unstim_6
-------
seqinfo: 24 sequences from an unspecified genome; no seqlengths
# Code cell n°36
STAT1_ifng_peaks_hg18GR <- makeGRangesFromDataFrame(STAT1_ifng_peaks_hg18, keep.extra.columns = TRUE)
str(STAT1_ifng_peaks_hg18GR)
Formal class 'GRanges' [package "GenomicRanges"] with 7 slots ..@ seqnames :Formal class 'Rle' [package "S4Vectors"] with 4 slots .. .. ..@ values : Factor w/ 24 levels "chr1","chr2",..: 10 11 12 13 14 15 16 17 18 19 ... .. .. ..@ lengths : int [1:24] 1981 1808 2280 850 1062 1435 1572 2004 725 1346 ... .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ ranges :Formal class 'IRanges' [package "IRanges"] with 6 slots .. .. ..@ start : int [1:41582] 142456 302744 389895 729167 1023827 1083937 1092125 1301013 1390001 1986088 ... .. .. ..@ width : int [1:41582] 798 652 646 827 952 1901 1078 385 2080 925 ... .. .. ..@ NAMES : NULL .. .. ..@ elementType : chr "ANY" .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ strand :Formal class 'Rle' [package "S4Vectors"] with 4 slots .. .. ..@ values : Factor w/ 3 levels "+","-","*": 3 .. .. ..@ lengths : int 41582 .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ seqinfo :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots .. .. ..@ seqnames : chr [1:24] "chr1" "chr2" "chr3" "chr4" ... .. .. ..@ seqlengths : int [1:24] NA NA NA NA NA NA NA NA NA NA ... .. .. ..@ is_circular: logi [1:24] NA NA NA NA NA NA ... .. .. ..@ genome : chr [1:24] NA NA NA NA ... ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots .. .. ..@ rownames : NULL .. .. ..@ nrows : int 41582 .. .. ..@ listData :List of 4 .. .. .. ..$ height : int [1:41582] 46 30 12 13 14 12 11 30 31 29 ... .. .. .. ..$ numtags : int [1:41582] 74 44 23 27 28 70 30 36 94 76 ... .. .. .. ..$ max_ht_offset: int [1:41582] 414 405 338 371 539 1186 581 171 1716 494 ... .. .. .. ..$ ID : chr [1:41582] "ifng_1" "ifng_2" "ifng_3" "ifng_4" ... .. .. ..@ elementType : chr "ANY" .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ elementType : chr "ANY" ..@ metadata : list()
# Code cell n°37
head(STAT1_ifng_peaks_hg18GR)
GRanges object with 6 ranges and 4 metadata columns:
seqnames ranges strand | height numtags max_ht_offset ID
<Rle> <IRanges> <Rle> | <integer> <integer> <integer> <character>
[1] chr10 142456-143253 * | 46 74 414 ifng_1
[2] chr10 302744-303395 * | 30 44 405 ifng_2
[3] chr10 389895-390540 * | 12 23 338 ifng_3
[4] chr10 729167-729993 * | 13 27 371 ifng_4
[5] chr10 1023827-1024778 * | 14 28 539 ifng_5
[6] chr10 1083937-1085837 * | 12 70 1186 ifng_6
-------
seqinfo: 24 sequences from an unspecified genome; no seqlengths
rtraclayer¶Now we can convert the coordinates with the liftover function.
# Code cell n°38
STAT1_unstim_peaks_hg38GR <- unlist(rtracklayer::liftOver(STAT1_unstim_peaks_hg18GR, chain))
head(STAT1_unstim_peaks_hg38GR)
STAT1_ifng_peaks_hg38GR <- unlist(rtracklayer::liftOver(STAT1_ifng_peaks_hg18GR, chain))
head(STAT1_ifng_peaks_hg38GR)
GRanges object with 6 ranges and 4 metadata columns:
seqnames ranges strand | height numtags max_ht_offset ID
<Rle> <IRanges> <Rle> | <integer> <integer> <integer> <character>
[1] chr10 931335-931926 * | 14 20 379 unstim_1
[2] chr10 1978329-1979228 * | 11 28 374 unstim_2
[3] chr10 2464298-2464831 * | 12 23 214 unstim_3
[4] chr10 3043547-3043932 * | 11 13 130 unstim_4
[5] chr10 3090674-3091662 * | 12 25 394 unstim_5
[6] chr10 3825117-3825748 * | 11 22 192 unstim_6
-------
seqinfo: 24 sequences from an unspecified genome; no seqlengths
GRanges object with 6 ranges and 4 metadata columns:
seqnames ranges strand | height numtags max_ht_offset ID
<Rle> <IRanges> <Rle> | <integer> <integer> <integer> <character>
[1] chr10 106516-107313 * | 46 74 414 ifng_1
[2] chr10 266804-267455 * | 30 44 405 ifng_2
[3] chr10 353955-354600 * | 12 23 338 ifng_3
[4] chr10 693227-694053 * | 13 27 371 ifng_4
[5] chr10 987887-988838 * | 14 28 539 ifng_5
[6] chr10 1047997-1049897 * | 12 70 1186 ifng_6
-------
seqinfo: 24 sequences from an unspecified genome; no seqlengths
If you did the conversion with the UCSC tool online, you should find the same coordinates, as you can check for the first 6 peaks.
# Code cell n°39
head(STAT1_unstim_peaks_hg38)
head(STAT1_ifng_peaks_hg38)
| chr | start | end | ID | height | numtags | max_ht_offset | |
|---|---|---|---|---|---|---|---|
| <chr> | <int> | <int> | <chr> | <int> | <int> | <int> | |
| 1 | chr10 | 931335 | 931926 | unstim_1 | 14 | 2 | 379 |
| 2 | chr10 | 1978329 | 1979228 | unstim_2 | 11 | 2 | 374 |
| 3 | chr10 | 2464298 | 2464831 | unstim_3 | 12 | 2 | 214 |
| 4 | chr10 | 3043547 | 3043932 | unstim_4 | 11 | 1 | 130 |
| 5 | chr10 | 3090674 | 3091662 | unstim_5 | 12 | 2 | 394 |
| 6 | chr10 | 3825117 | 3825748 | unstim_6 | 11 | 2 | 192 |
| chr | start | end | ID | height | numtags | max_ht_offset | |
|---|---|---|---|---|---|---|---|
| <chr> | <int> | <int> | <chr> | <int> | <int> | <int> | |
| 1 | chr10 | 106516 | 107313 | ifng_1 | 46 | 7 | 414 |
| 2 | chr10 | 266804 | 267455 | ifng_2 | 30 | 4 | 405 |
| 3 | chr10 | 353955 | 354600 | ifng_3 | 12 | 2 | 338 |
| 4 | chr10 | 693227 | 694053 | ifng_4 | 13 | 2 | 371 |
| 5 | chr10 | 987887 | 988838 | ifng_5 | 14 | 2 | 539 |
| 6 | chr10 | 1047997 | 1049897 | ifng_6 | 12 | 7 | 1186 |
If needed for further analyses, you can convert GRanges objects to dataframes using as.data.frame().
To do so, we could use the biomaRt package as we used it in R session 5 in the R5_tuto_packages.ipynb.
# Code cell n°40
listMarts(host = "www.ensembl.org")
ensembl <- useMart("ENSEMBL_MART_ENSEMBL", host = "www.ensembl.org", dataset = "hsapiens_gene_ensembl")
Warning message: “Ensembl will soon enforce the use of https. Ensure the 'host' argument includes "https://"”
| biomart | version |
|---|---|
| <chr> | <chr> |
| ENSEMBL_MART_ENSEMBL | Ensembl Genes 108 |
| ENSEMBL_MART_MOUSE | Mouse strains 108 |
| ENSEMBL_MART_SNP | Ensembl Variation 108 |
| ENSEMBL_MART_FUNCGEN | Ensembl Regulation 108 |
Warning message: “Ensembl will soon enforce the use of https. Ensure the 'host' argument includes "https://"”
We now need a GenomicRanges object with the gene annotations.
# Code cell n°41
data(ens.gene.ann.hg18 )
str(ens.gene.ann.hg18)
'data.frame': 21604 obs. of 7 variables: $ Gene : chr "MIRN1302-2" "FAM138E" "FAM138E" "FAM138A" ... $ EnsID : chr "ENSG00000221311" "ENSG00000222027" "ENSG00000222003" "ENSG00000222003" ... $ Chromosome: chr "1" "1" "1" "1" ... $ Start : int 20229 24417 24417 24417 58954 357522 610959 742614 850393 869459 ... $ End : int 20366 25944 25944 25944 59871 358460 611897 745077 869824 884494 ... $ band : chr "p36.33" "p36.33" "p36.33" "p36.33" ... $ strand : int 1 -1 -1 -1 1 1 -1 1 1 -1 ...
For homogeneity with other GR objects, we add "chr" to name chromosomes.
# Code cell n°42
ens.gene.ann.hg18$Chromosome <- paste0("chr", ens.gene.ann.hg18$Chromosome)
head(ens.gene.ann.hg18)
| Gene | EnsID | Chromosome | Start | End | band | strand | |
|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <int> | <int> | <chr> | <int> | |
| 21297 | MIRN1302-2 | ENSG00000221311 | chr1 | 20229 | 20366 | p36.33 | 1 |
| 21 | FAM138E | ENSG00000222027 | chr1 | 24417 | 25944 | p36.33 | -1 |
| 827 | FAM138E | ENSG00000222003 | chr1 | 24417 | 25944 | p36.33 | -1 |
| 828 | FAM138A | ENSG00000222003 | chr1 | 24417 | 25944 | p36.33 | -1 |
| 829 | OR4F5 | ENSG00000177693 | chr1 | 58954 | 59871 | p36.33 | 1 |
| 830 | OR4F29 | ENSG00000177799 | chr1 | 357522 | 358460 | p36.33 | 1 |
# Code cell n°43
ens.gene.ann.hg18 <- ens.gene.ann.hg18[! duplicated(ens.gene.ann.hg18$Gene),]
To read it as a GenomicRanges object, we need to convert the strand encoding.
# Code cell n°44
ens.gene.ann.hg18$strand <- ifelse(ens.gene.ann.hg18$strand == 1, '+', '-')
ens.gene.ann.hg18GR <-makeGRangesFromDataFrame(ens.gene.ann.hg18, keep.extra.columns = TRUE)
str(ens.gene.ann.hg18GR)
Formal class 'GRanges' [package "GenomicRanges"] with 7 slots ..@ seqnames :Formal class 'Rle' [package "S4Vectors"] with 4 slots .. .. ..@ values : Factor w/ 24 levels "chr1","chr2",..: 1 10 11 12 13 14 15 16 17 18 ... .. .. ..@ lengths : int [1:24] 2221 818 1337 1048 360 806 669 812 1205 271 ... .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ ranges :Formal class 'IRanges' [package "IRanges"] with 6 slots .. .. ..@ start : int [1:21090] 20229 24417 24417 58954 357522 610959 742614 850393 869459 885830 ... .. .. ..@ width : int [1:21090] 138 1528 1528 918 939 939 2464 19432 15036 5129 ... .. .. ..@ NAMES : chr [1:21090] "21297" "21" "828" "829" ... .. .. ..@ elementType : chr "ANY" .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ strand :Formal class 'Rle' [package "S4Vectors"] with 4 slots .. .. ..@ values : Factor w/ 3 levels "+","-","*": 1 2 1 2 1 2 1 2 1 2 ... .. .. ..@ lengths : int [1:9893] 1 2 2 1 2 1 2 2 2 1 ... .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ seqinfo :Formal class 'Seqinfo' [package "GenomeInfoDb"] with 4 slots .. .. ..@ seqnames : chr [1:24] "chr1" "chr2" "chr3" "chr4" ... .. .. ..@ seqlengths : int [1:24] NA NA NA NA NA NA NA NA NA NA ... .. .. ..@ is_circular: logi [1:24] NA NA NA NA NA NA ... .. .. ..@ genome : chr [1:24] NA NA NA NA ... ..@ elementMetadata:Formal class 'DFrame' [package "S4Vectors"] with 6 slots .. .. ..@ rownames : NULL .. .. ..@ nrows : int 21090 .. .. ..@ listData :List of 3 .. .. .. ..$ Gene : chr [1:21090] "MIRN1302-2" "FAM138E" "FAM138A" "OR4F5" ... .. .. .. ..$ EnsID: chr [1:21090] "ENSG00000221311" "ENSG00000222027" "ENSG00000222003" "ENSG00000177693" ... .. .. .. ..$ band : chr [1:21090] "p36.33" "p36.33" "p36.33" "p36.33" ... .. .. ..@ elementType : chr "ANY" .. .. ..@ elementMetadata: NULL .. .. ..@ metadata : list() ..@ elementType : chr "ANY" ..@ metadata : list()
# Code cell n°45
head(ens.gene.ann.hg18GR)
GRanges object with 6 ranges and 3 metadata columns:
seqnames ranges strand | Gene EnsID band
<Rle> <IRanges> <Rle> | <character> <character> <character>
21297 chr1 20229-20366 + | MIRN1302-2 ENSG00000221311 p36.33
21 chr1 24417-25944 - | FAM138E ENSG00000222027 p36.33
828 chr1 24417-25944 - | FAM138A ENSG00000222003 p36.33
829 chr1 58954-59871 + | OR4F5 ENSG00000177693 p36.33
830 chr1 357522-358460 + | OR4F29 ENSG00000177799 p36.33
831 chr1 610959-611897 - | OR4F16 ENSG00000185097 p36.33
-------
seqinfo: 24 sequences from an unspecified genome; no seqlengths
It is worth noting that some genes are not present in our DE object:
# Code cell n°46
length(intersect(ens.gene.ann.hg18$Gene, DE$TargetID))
length(setdiff(ens.gene.ann.hg18$Gene, DE$TargetID))
length(setdiff(DE$TargetID, ens.gene.ann.hg18$Gene))
# Code cell n°47
suppressWarnings(ens.gene.ann.hg38 <- select(org.Hs.eg.db,
keys = keys(org.Hs.egENSEMBL2EG),
columns=c("ENSEMBL","ENTREZID","SYMBOL","GENENAME", "CHRLOC","CHRLOCEND"),
keytype="ENSEMBL"))
head(ens.gene.ann.hg38)
'select()' returned 1:many mapping between keys and columns
| ENSEMBL | ENTREZID | SYMBOL | GENENAME | CHRLOC | CHRLOCCHR | CHRLOCEND | |
|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <int> | <chr> | <int> | |
| 1 | ENSG00000121410 | 1 | A1BG | alpha-1-B glycoprotein | -58345182 | 19 | -58353492 |
| 2 | ENSG00000175899 | 2 | A2M | alpha-2-macroglobulin | -9067707 | 12 | -9115919 |
| 3 | ENSG00000175899 | 2 | A2M | alpha-2-macroglobulin | -9067707 | 12 | -9116229 |
| 4 | ENSG00000256069 | 3 | A2MP1 | alpha-2-macroglobulin pseudogene 1 | -9228532 | 12 | -9234207 |
| 5 | ENSG00000171428 | 9 | NAT1 | N-acetyltransferase 1 | 18210108 | 8 | 18223689 |
| 6 | ENSG00000171428 | 9 | NAT1 | N-acetyltransferase 1 | 18170466 | 8 | 18223689 |
We rename the chromosome, start and end columns to be read by GenomicRanges and we also generate a strand column based on the sign of the geomic coordinates.
# Code cell n°48
names(ens.gene.ann.hg38)[5:7] <- c("start", "chr", "end")
ens.gene.ann.hg38$strand <- ifelse(ens.gene.ann.hg38$start > 0 , '+', '-')
ens.gene.ann.hg38$start <- abs(ens.gene.ann.hg38$start)
ens.gene.ann.hg38$end <- abs(ens.gene.ann.hg38$end)
ens.gene.ann.hg38 <- ens.gene.ann.hg38[,c(1:4,6,5,7:8)]
head(ens.gene.ann.hg38)
| ENSEMBL | ENTREZID | SYMBOL | GENENAME | chr | start | end | strand | |
|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <chr> | <int> | <int> | <chr> | |
| 1 | ENSG00000121410 | 1 | A1BG | alpha-1-B glycoprotein | 19 | 58345182 | 58353492 | - |
| 2 | ENSG00000175899 | 2 | A2M | alpha-2-macroglobulin | 12 | 9067707 | 9115919 | - |
| 3 | ENSG00000175899 | 2 | A2M | alpha-2-macroglobulin | 12 | 9067707 | 9116229 | - |
| 4 | ENSG00000256069 | 3 | A2MP1 | alpha-2-macroglobulin pseudogene 1 | 12 | 9228532 | 9234207 | - |
| 5 | ENSG00000171428 | 9 | NAT1 | N-acetyltransferase 1 | 8 | 18210108 | 18223689 | + |
| 6 | ENSG00000171428 | 9 | NAT1 | N-acetyltransferase 1 | 8 | 18170466 | 18223689 | + |
Some genes have no coordinates, so we remove them.
# Code cell n°49
ens.gene.ann.hg38 <- subset(ens.gene.ann.hg38, ! is.na(chr))
For homogeneity with other GR objects, we add "chr" to name chromosomes.
# Code cell n°50
ens.gene.ann.hg38$chr <- paste0("chr", ens.gene.ann.hg38$chr)
head(ens.gene.ann.hg38)
| ENSEMBL | ENTREZID | SYMBOL | GENENAME | chr | start | end | strand | |
|---|---|---|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <chr> | <int> | <int> | <chr> | |
| 1 | ENSG00000121410 | 1 | A1BG | alpha-1-B glycoprotein | chr19 | 58345182 | 58353492 | - |
| 2 | ENSG00000175899 | 2 | A2M | alpha-2-macroglobulin | chr12 | 9067707 | 9115919 | - |
| 3 | ENSG00000175899 | 2 | A2M | alpha-2-macroglobulin | chr12 | 9067707 | 9116229 | - |
| 4 | ENSG00000256069 | 3 | A2MP1 | alpha-2-macroglobulin pseudogene 1 | chr12 | 9228532 | 9234207 | - |
| 5 | ENSG00000171428 | 9 | NAT1 | N-acetyltransferase 1 | chr8 | 18210108 | 18223689 | + |
| 6 | ENSG00000171428 | 9 | NAT1 | N-acetyltransferase 1 | chr8 | 18170466 | 18223689 | + |
# Code cell n°51
ens.gene.ann.hg38 <- ens.gene.ann.hg38[! duplicated(ens.gene.ann.hg38$SYMBOL),]
# Code cell n°52
ens.gene.ann.hg38GR <- makeGRangesFromDataFrame(ens.gene.ann.hg38, keep.extra.columns = TRUE)
head(ens.gene.ann.hg38GR)
GRanges object with 6 ranges and 4 metadata columns:
seqnames ranges strand | ENSEMBL ENTREZID SYMBOL GENENAME
<Rle> <IRanges> <Rle> | <character> <character> <character> <character>
1 chr19 58345182-58353492 - | ENSG00000121410 1 A1BG alpha-1-B glycoprotein
2 chr12 9067707-9115919 - | ENSG00000175899 2 A2M alpha-2-macroglobulin
4 chr12 9228532-9234207 - | ENSG00000256069 3 A2MP1 alpha-2-macroglobuli..
5 chr8 18210108-18223689 + | ENSG00000171428 9 NAT1 N-acetyltransferase 1
9 chr8 18391281-18401215 + | ENSG00000156006 10 NAT2 N-acetyltransferase 2
10 chr14 94612390-94624053 + | ENSG00000196136 12 SERPINA3 serpin family A memb..
-------
seqinfo: 200 sequences from an unspecified genome; no seqlengths
Again, we have differences with some genes that are not present in our DE object and vice versa.
# Code cell n°53
length(intersect(ens.gene.ann.hg38$SYMBOL, DE$TargetID))
length(setdiff(ens.gene.ann.hg38$SYMBOL, DE$TargetID))
length(setdiff(DE$TargetID, ens.gene.ann.hg38$SYMBOL))
The distanceToNearest() function of GenomicRanges directly returns the nearest features (present in a subject dataset, here the genes) from a query (here the peaks) and computes the distances between pairwise elements.
# Code cell n°54
dists_unstim_hg18 <- suppressWarnings(distanceToNearest(STAT1_unstim_peaks_hg18GR, ens.gene.ann.hg18GR))
head(dists_unstim_hg18)
Hits object with 6 hits and 1 metadata column:
queryHits subjectHits | distance
<integer> <integer> | <integer>
[1] 1 2226 | 0
[2] 2 2231 | 240852
[3] 3 2232 | 602716
[4] 4 2232 | 23615
[5] 5 2232 | 0
[6] 6 2234 | 39853
-------
queryLength: 11004 / subjectLength: 21090
# Code cell n°55
dists_unstim_hg38 <- suppressWarnings(distanceToNearest(STAT1_unstim_peaks_hg38GR, ens.gene.ann.hg38GR))
head(dists_unstim_hg38)
Hits object with 6 hits and 1 metadata column:
queryHits subjectHits | distance
<integer> <integer> | <integer>
[1] 1 8186 | 0
[2] 2 24912 | 96789
[3] 3 24912 | 388208
[4] 4 3514 | 25255
[5] 5 3514 | 0
[6] 6 25604 | 8200
-------
queryLength: 11242 / subjectLength: 25970
# Code cell n°56
par(mfrow = c(2,1))
dist2plot <- mcols(dists_unstim_hg18)[,1]
hist(log10(dist2plot), xlab = "log10(dist to nearest gene on hg18) of peaks without stimulation",
main = "Distances")
dist2plot <- mcols(dists_unstim_hg38)[,1]
hist(log10(dist2plot), xlab = "log10(dist to nearest geneon hg38) of peaks without stimulation",
main = "Distances")
# Code cell n°57
dists_ifng_hg18 <- suppressWarnings(distanceToNearest(STAT1_ifng_peaks_hg18GR, ens.gene.ann.hg18GR))
dists_ifng_hg38 <- suppressWarnings(distanceToNearest(STAT1_ifng_peaks_hg38GR, ens.gene.ann.hg38GR))
# Code cell n°58
par(mfrow = c(2,1))
dist2plot <- mcols(dists_ifng_hg18)[,1]
hist(log10(dist2plot),xlab = "log10(dist to nearest gene on hg18) of peaks with IFNG stimulation",
main = "Distances")
dist2plot <- mcols(dists_ifng_hg38)[,1]
hist(log10(dist2plot),xlab = "log10(dist to nearest geneon hg38)of peaks with IFNG stimulation",
main = "Distances")
How could you statistically compare these distributions? Enter your code in the next 3 cells n°59 to 61.
# Code cell n°59
t.test(mcols(dists_unstim_hg18)[,1], mcols(dists_unstim_hg38)[,1])
Welch Two Sample t-test data: mcols(dists_unstim_hg18)[, 1] and mcols(dists_unstim_hg38)[, 1] t = 3.8363, df = 20340, p-value = 0.0001253 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 4705.042 14535.521 sample estimates: mean of x mean of y 63853.25 54232.96
# Code cell n°60
t.test(mcols(dists_ifng_hg18)[,1], mcols(dists_ifng_hg38)[,1])
Welch Two Sample t-test data: mcols(dists_ifng_hg18)[, 1] and mcols(dists_ifng_hg38)[, 1] t = 15.697, df = 71626, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 14037.57 18043.35 sample estimates: mean of x mean of y 55860.16 39819.70
# Code cell n°61
t.test(mcols(dists_unstim_hg38)[,1], mcols(dists_ifng_hg38)[,1])
Welch Two Sample t-test data: mcols(dists_unstim_hg38)[, 1] and mcols(dists_ifng_hg38)[, 1] t = 8.845, df = 14541, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 11219.17 17607.35 sample estimates: mean of x mean of y 54232.96 39819.70
# Code cell n°62
dists_unstim_hg18_df <- as.data.frame(dists_unstim_hg18)
dists_ifng_hg18_df <- as.data.frame(dists_ifng_hg18)
head(dists_unstim_hg18_df)
head(dists_ifng_hg18_df)
dists_unstim_hg38_df <- as.data.frame(dists_unstim_hg38)
dists_ifng_hg38_df <- as.data.frame(dists_ifng_hg38)
head(dists_unstim_hg38_df)
head(dists_ifng_hg38_df)
| queryHits | subjectHits | distance | |
|---|---|---|---|
| <int> | <int> | <int> | |
| 1 | 1 | 2226 | 0 |
| 2 | 2 | 2231 | 240852 |
| 3 | 3 | 2232 | 602716 |
| 4 | 4 | 2232 | 23615 |
| 5 | 5 | 2232 | 0 |
| 6 | 6 | 2234 | 39853 |
| queryHits | subjectHits | distance | |
|---|---|---|---|
| <int> | <int> | <int> | |
| 1 | 1 | 2222 | 20250 |
| 2 | 2 | 2224 | 8036 |
| 3 | 3 | 2224 | 0 |
| 4 | 4 | 2224 | 3648 |
| 5 | 5 | 2227 | 0 |
| 6 | 6 | 2229 | 0 |
| queryHits | subjectHits | distance | |
|---|---|---|---|
| <int> | <int> | <int> | |
| 1 | 1 | 8186 | 0 |
| 2 | 2 | 24912 | 96789 |
| 3 | 3 | 24912 | 388208 |
| 4 | 4 | 3514 | 25255 |
| 5 | 5 | 3514 | 0 |
| 6 | 6 | 25604 | 8200 |
| queryHits | subjectHits | distance | |
|---|---|---|---|
| <int> | <int> | <int> | |
| 1 | 1 | 7450 | 27485 |
| 2 | 2 | 8025 | 6744 |
| 3 | 3 | 8025 | 0 |
| 4 | 4 | 8025 | 3558 |
| 5 | 5 | 8481 | 0 |
| 6 | 6 | 2347 | 0 |
queryHits index and the subjectHits index. We do that with two consecutive merging.For hg18 we need to add a column "index" since the row names do not match indexes.
# Code cell n°63
ens.gene.ann.hg18$index <- 1:nrow(ens.gene.ann.hg18)
# Code cell n°64
dists_unstim_hg18_df <- merge(dists_unstim_hg18_df, STAT1_unstim_peaks_hg18, by.x = "queryHits", by.y = 0, sort = F, all = T)
dists_unstim_hg18_df<- merge(dists_unstim_hg18_df, ens.gene.ann.hg18, by.x = "subjectHits", by.y = "index", all.x = T, all.y = F, sort = F)
# Code cell n°65
dists_ifng_hg18_df <- merge(dists_ifng_hg18_df, STAT1_ifng_peaks_hg18, by.x = "queryHits", by.y = 0, sort = F, all = T)
dists_ifng_hg18_df<- merge(dists_ifng_hg18_df, ens.gene.ann.hg18, by.x = "subjectHits", by.y = "index", all.x = T, all.y = F, sort = F)
# Code cell n°66
dists_unstim_hg38_df <- merge(dists_unstim_hg38_df, STAT1_unstim_peaks_hg38, by.x = "queryHits", by.y = 0, sort = F, all = T)
dists_unstim_hg38_df <- merge(dists_unstim_hg38_df, ens.gene.ann.hg38, by.x = "subjectHits", by.y = 0, all.x = T, all.y = F, sort = F)
# Code cell n°67
dists_ifng_hg38_df <- merge(dists_ifng_hg38_df, STAT1_ifng_peaks_hg38, by.x = "queryHits", by.y = 0, sort = F, all = T)
dists_ifng_hg38_df <- merge(dists_ifng_hg38_df, ens.gene.ann.hg38, by.x = "subjectHits", by.y = 0, all.x = T, all.y = F, sort = F)
# Code cell n°68
head(dists_unstim_hg18_df)
| subjectHits | queryHits | distance | chr | height | start | end | numtags | max_ht_offset | ID | Gene | EnsID | Chromosome | Start | End | band | strand | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <chr> | <int> | <chr> | <int> | <int> | <int> | <int> | <int> | <chr> | <chr> | <chr> | <chr> | <int> | <int> | <chr> | <chr> | |
| 1 | 2226 | 1 | 0 | chr10 | 14 | 967275 | 967866 | 20 | 379 | unstim_1 | LARP5 | ENSG00000107929 | chr10 | 845484 | 967440 | p15.3 | - |
| 2 | 2231 | 2 | 240852 | chr10 | 11 | 2010523 | 2011422 | 28 | 374 | unstim_2 | ADARB2 | ENSG00000185736 | chr10 | 1218073 | 1769670 | p15.3 | - |
| 3 | 2232 | 4 | 23615 | chr10 | 11 | 3075739 | 3076124 | 13 | 130 | unstim_4 | PFKP | ENSG00000067057 | chr10 | 3099740 | 3169762 | p15.2 | + |
| 4 | 2232 | 5 | 0 | chr10 | 12 | 3122866 | 3123854 | 25 | 394 | unstim_5 | PFKP | ENSG00000067057 | chr10 | 3099740 | 3169762 | p15.2 | + |
| 5 | 2232 | 3 | 602716 | chr10 | 12 | 2496490 | 2497023 | 23 | 214 | unstim_3 | PFKP | ENSG00000067057 | chr10 | 3099740 | 3169762 | p15.2 | + |
| 6 | 2234 | 6 | 39853 | chr10 | 11 | 3857309 | 3857940 | 22 | 192 | unstim_6 | KLF6 | ENSG00000067082 | chr10 | 3811260 | 3817455 | p15.1 | - |
# Code cell n°69
dists_unstim_hg18_df %<>% rename(chr.peak = chr, start.peak = start, end.peak = end,
chr.gene = Chromosome, start.gene = Start, end.gene = End)
# Code cell n°70
dists_ifng_hg18_df %<>% rename(chr.peak = chr, start.peak = start, end.peak = end,
chr.gene = Chromosome, start.gene = Start, end.gene = End)
# Code cell n°71
head(dists_unstim_hg38_df)
| subjectHits | queryHits | distance | chr.x | start.x | end.x | ID | height | numtags | max_ht_offset | ENSEMBL | ENTREZID | SYMBOL | GENENAME | chr.y | start.y | end.y | strand | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <int> | <int> | <chr> | <int> | <int> | <chr> | <int> | <int> | <int> | <chr> | <chr> | <chr> | <chr> | <chr> | <int> | <int> | <chr> | |
| 1 | 25101 | 12 | 0 | chr10 | 5595861 | 5596286 | unstim_12 | 12 | 1 | 227 | ENSG00000197079 | 3886 | KRT35 | keratin 35 | chr17 | 41476688 | 41481140 | - |
| 2 | 25734 | 17 | 257389 | chr10 | 8640451 | 8640598 | unstim_17 | 13 | 1 | 0 | ENSG00000197063 | 4097 | MAFG | MAF bZIP transcription factor G | chr17 | 81918269 | 81927735 | - |
| 3 | 4267 | 28 | 0 | chr10 | 16699331 | 16699599 | unstim_28 | 14 | 1 | 121 | ENSG00000164821 | 1669 | DEFA4 | defensin alpha 4 | chr8_KZ208915v1_fix | 330942 | 333429 | - |
| 4 | 3578 | 30 | 44603 | chr10 | 22758712 | 22759008 | unstim_30 | 11 | 1 | 149 | ENSG00000182578 | 1436 | CSF1R | colony stimulating factor 1 receptor | chr5 | 150053290 | 150113372 | - |
| 5 | 4737 | 39 | 149042 | chr10 | 29784014 | 29785135 | unstim_39 | 14 | 3 | 488 | ENSG00000134001 | 1965 | EIF2S1 | eukaryotic translation initiation factor 2 subunit alpha | chr14 | 67360327 | 67386516 | + |
| 6 | 943 | 42 | 20152 | chr10 | 30413909 | 30414492 | unstim_42 | 12 | 1 | 272 | ENSG00000161944 | 433 | ASGR2 | asialoglycoprotein receptor 2 | chr17 | 7101321 | 7114811 | - |
# Code cell n°72
dists_unstim_hg38_df %<>% rename(chr.peak = chr.x, start.peak = start.x, end.peak = end.x,
chr.gene = chr.y, start.gene = start.y, end.gene = end.y)
dists_ifng_hg38_df %<>% rename(chr.peak = chr.x, start.peak = start.x, end.peak = end.x,
chr.gene = chr.y, start.gene = start.y, end.gene = end.y)
Thus we obtain two dataframes with the peak coordinates on the left and the putative target genes on the right.
# Code cell n°73
DE %<>% mutate(stat1_target_unstim_hg18 = TargetID %in% dists_unstim_hg18_df$Gene,
stat1_target_ifng_hg18 = TargetID %in% dists_ifng_hg18_df$Gene,
stat1_target_unstim_hg38 = TargetID %in% dists_unstim_hg38_df$SYMBOL,
stat1_target_ifng_hg38 = TargetID %in% dists_ifng_hg38_df$SYMBOL)
# Code cell n°74
table(DE$stat1_target_ifng_hg38, DE$stat1_target_unstim_hg38)
FALSE TRUE
FALSE 32932 42
TRUE 931 789
# Code cell n°75
DE %>% filter(adj.P.Val < 0.05) %>% count(stat1_target_ifng_hg38, stat1_target_unstim_hg38)
| stat1_target_ifng_hg38 | stat1_target_unstim_hg38 | n |
|---|---|---|
| <lgl> | <lgl> | <int> |
| FALSE | FALSE | 415 |
| TRUE | FALSE | 27 |
| TRUE | TRUE | 21 |
You could go further by performing an overrepresentation test of DE genes among the set of genes that are putative targets for STAT1 (cf. session 1).
# Code cell n°76
DE %>% group_by(stat1_target_unstim_hg38) %>%
summarise(counts = n(),
median = median(logFC),
Q1 = quantile(logFC, probs=0.25),
Q3 = quantile(logFC, probs=0.75),
mean = mean(logFC),
mean_ci_low = mean_ci(logFC)$ymin ,
mean_ci_high = mean_ci(logFC)$ymax,
std = sd(logFC))
| stat1_target_unstim_hg38 | counts | median | Q1 | Q3 | mean | mean_ci_low | mean_ci_high | std |
|---|---|---|---|---|---|---|---|---|
| <lgl> | <int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| FALSE | 33863 | -0.000699005 | -0.06331075 | 0.06222000 | -0.0002311149 | -0.001965064 | 0.001502834 | 0.1627929 |
| TRUE | 831 | -0.012559443 | -0.10401132 | 0.07127871 | -0.0317904550 | -0.048031524 | -0.015549386 | 0.2385247 |
# Code cell n°77
DE %>% group_by(stat1_target_ifng_hg38) %>%
summarise(counts = n(),
median = median(logFC),
Q1 = quantile(logFC, probs=0.25),
Q3 = quantile(logFC, probs=0.75),
mean = mean(logFC),
mean_ci_low = mean_ci(logFC)$ymin ,
mean_ci_high = mean_ci(logFC)$ymax,
std = sd(logFC))
| stat1_target_ifng_hg38 | counts | median | Q1 | Q3 | mean | mean_ci_low | mean_ci_high | std |
|---|---|---|---|---|---|---|---|---|
| <lgl> | <int> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> | <dbl> |
| FALSE | 32974 | -0.0004782265 | -0.06287669 | 0.06211007 | 0.0004298819 | -0.00130122 | 0.002160984 | 0.1603779 |
| TRUE | 1720 | -0.0126143421 | -0.09655945 | 0.07195766 | -0.0281506035 | -0.03932599 | -0.016975222 | 0.2363048 |
# Code cell n°78
plot1 <- DE %>% filter(adj.P.Val < 0.05) %>% ggplot(aes(x = stat1_target_unstim_hg38, y = logFC, color = stat1_target_unstim_hg38)) +
geom_violin() +
geom_boxplot(width = 0.05, col = "black", outlier.shape = NA) +
geom_jitter() +
stat_compare_means(method = "t.test", label.x = 1.5, label.y = 2.5) +
stat_compare_means( label.x = 1.5)
# Code cell n°79
plot2 <- DE %>% filter(adj.P.Val < 0.05) %>% ggplot(aes(x = stat1_target_ifng_hg38, y = logFC, color = stat1_target_ifng_hg38)) +
geom_violin() +
geom_boxplot(width = 0.05, col = "black", outlier.shape = NA) +
geom_jitter() +
stat_compare_means(method = "t.test", label.x = 1.5, label.y = 2.5) +
stat_compare_means( label.x = 1.5)
# Code cell n°80
options(repr.plot.width = 20, repr.plot.height = 10)
ggarrange(plot1, plot2)
=> Explain in details the above commands in cells 78-80.
To answer that question, we will look at the correlation between peaks heights and average gene expression.
# Code cell n°81
str(dists_unstim_hg18_df)
'data.frame': 11004 obs. of 17 variables: $ subjectHits : int 2226 2231 2232 2232 2232 2234 2235 2235 2235 2243 ... $ queryHits : chr "1" "2" "4" "5" ... $ distance : int 0 240852 23615 0 602716 39853 121496 1595 481039 0 ... $ chr.peak : chr "chr10" "chr10" "chr10" "chr10" ... $ height : int 14 11 11 12 12 11 14 11 13 1935 ... $ start.peak : int 967275 2010523 3075739 3122866 2496490 3857309 4736305 4881845 4376424 5461184 ... $ end.peak : int 967866 2011422 3076124 3123854 2497023 3857940 4736905 4882581 4377362 5461374 ... $ numtags : int 20 28 13 25 23 22 22 22 26 1936 ... $ max_ht_offset: int 379 374 130 394 214 192 226 150 349 43 ... $ ID : chr "unstim_1" "unstim_2" "unstim_4" "unstim_5" ... $ Gene : chr "LARP5" "ADARB2" "PFKP" "PFKP" ... $ EnsID : chr "ENSG00000107929" "ENSG00000185736" "ENSG00000067057" "ENSG00000067057" ... $ chr.gene : chr "chr10" "chr10" "chr10" "chr10" ... $ start.gene : int 845484 1218073 3099740 3099740 3099740 3811260 4858402 4858402 4858402 5444518 ... $ end.gene : int 967440 1769670 3169762 3169762 3169762 3817455 4880249 4880249 4880249 5490401 ... $ band : chr "p15.3" "p15.3" "p15.2" "p15.2" ... $ strand : chr "-" "-" "+" "+" ...
# Code cell n°82
dists_unstim_hg18_df %<>%
left_join(DE %>% dplyr::select(TargetID, logFC, AveExpr, adj.P.Val) ,
by = c("Gene" = "TargetID"))
dists_ifng_hg18_df %<>%
left_join(DE %>% dplyr::select(TargetID, logFC, AveExpr, adj.P.Val) ,
by = c("Gene" = "TargetID"))
# Code cell n°83
dists_unstim_hg18_df %>% filter(adj.P.Val < 0.05) %>% ggplot(aes(x = height, y = AveExpr)) + geom_point()
# Code cell n°84
myplot <- ggscatter(dists_unstim_hg18_df %>% filter(adj.P.Val < 0.005 & height < 50), x = "height", y = "AveExpr",
alpha=0.5,
size=1,
add = "reg.line",
conf.int = TRUE,
add.params = list(color = "blue",
fill = "lightgray")
) +
stat_cor(method = "spearman") +
ggtitle("Pearson correlation between ")
myplot + geom_text_repel(aes(label = Gene), size = 3)
`geom_smooth()` using formula = 'y ~ x'
To further explore these questions, you can exploit these ressources:
Epimap, Epigenome Integration across Multiple Annotation Projects: http://compbio.mit.edu/epimap/
TADKB, an integrated resource for exploring topologically associating domains (TADs): http://dna.cs.miami.edu/TADKB/
netZoo, a catalog of gene regulatory network inference and analysis methods: https://netzoo.github.io/
GSDB, a database of three-dimensional (3D) chromosome and genome structures reconstructed from Hi-C data: http://sysbio.rnet.missouri.edu/3dgenome/GSDB/index.php (obsolete link...)
[version 12/12/2022 - last revision:@Scaburet]